Skip to main content
Erschienen in: Integrating Materials and Manufacturing Innovation 1/2018

Open Access 09.01.2018 | Technical Article

A Relation Aware Search Engine for Materials Science

verfasst von: Sapan Shah, Dhwani Vora, B. P. Gautham, Sreedhar Reddy

Erschienen in: Integrating Materials and Manufacturing Innovation | Ausgabe 1/2018

Aktivieren Sie unsere intelligente Suche um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Knowledge of material properties, microstructure, underlying material composition, and manufacturing process parameters that the material has undergone is of significant interest to materials scientists and engineers. A large amount of information of this nature is available in publications in the form of experimental measurements, simulation results, etc. However, getting to the right information of this kind that is relevant for a given problem on hand is a non-trivial task. First, an engineer has to go through a large collection of documents to select the right ones. Then, the engineer has to scan through these selected documents to extract relevant pieces of information. Our goal is to help automate some of these steps. Traditional search engines are not of much help here, as they are keyword centric and weak on relation processing. In this paper, we present a domain-specific search engine that processes relations to significantly improve search accuracy. The engine preprocesses material publication repositories to extract entities such as material compositions, material properties, manufacturing processes, process parameters, and their values and builds an index using these entities and values. The engine then uses this index to process user queries to retrieve relevant publication fragments. It provides a domain-specific query language with relational and logical operators to compose complex queries. We have conducted an experiment on a small library of publications on steel on which searches such as “get the list of publications which have carbon composition between 0.2 and 0.3 and on which tempering is carried out for about 30 to 40 min” are performed. We compare the results of our search engine with the results of a keyword-based search engine.

Introduction

Knowledge of products, materials, processes, and process-structure-property relationships and the ability to exploit this knowledge to systematically guide the design space exploration during product or material development are critical to the success of the Integrated Computational Materials Engineering (ICME) [1] approach. This knowledge comes from a variety of sources. In this paper, we focus on information extraction from materials science and engineering literature—publications, internal company reports, online articles, and so on. Our objective is to mine knowledge about material compositions, properties, processes, and process-structure-property relations that are relevant to a given problem context. To consider an example, suppose an engineer wants to know which composition of steel should be used to achieve a hardness of 50 Rc and above in a product. Here, the context of interest is composition of steel that can give hardness ≥ 50 Rc. This requires retrieving publications which are about steel and looking for cases having hardness above 50 Rc. A simple keyword-based search for “hardness and composition and steel” will not be helpful as it will turn up too many results. We need a more intelligent search mechanism that can answer user queries taking value relations into account, such as [steel and composition and hardness ≥ 50 Rc]. However, the best that can be done to formulate this query using keyword-based search is [steel and composition and hardness and 50 Rc]. There are several obvious issues with this kind of keyword-based search. First, it looks only for the presence of keywords and retrieves all publications where they are present. In the current example, it will retrieve a publication even when the terms hardness and 50 Rc are unrelated but simply present in different parts of the document. This results in retrieval of a lot of publications that do not match the problem context (precision error). Secondly, due to the absence of value relations, it cannot handle range queries. For instance, in the current example, it does not retrieve a publication in which hardness of the final product is 60 Rc (even though it is greater than 50 Rc). This results in non-retrieval of a lot of publications that actually match the problem context (recall errors). Suppose from the results retrieved, the user discovers that a hardening process with temperature 930 °C and cooling rate during quenching of 100 °C/min on a composition with carbon in the range 0.5–0.8 wt% achieves the required hardness. The user might then want to know what happens to the toughness of the material when subjected to these processing conditions, with a query such as the following: [material = steel & hardness > 50 Rc & composition.carbon = [0.5, 0.8] wt% & hardening [temperature = 930 °C] & Quenching [coolingRate = 100 °C/min] & property = toughness]. Again, keyword-based search will not work and we have to take value relations into account. To the best of our knowledge, such techniques do not exist in materials engineering currently. In biomedical domain, researchers have developed enhanced information retrieval (IR) techniques which, on top of keyword-based search, provide concept (such as protein, and genes)-based article categorization and search filtering, query refinement, and so on (for example, [2]). However, these techniques do not provide support for the kind of value-based retrieval required in materials engineering.
In this paper, we present a system that is capable of supporting value-based queries of the kind discussed above to retrieve information from materials science publications. To support such queries, the system has to first extract information on various entities, values, and relationships among them. Figure 1 shows the extraction template used in this work. The entities of interest to us are material elemental composition, material properties, and processing conditions (manufacturing processes and their parameters). Similarly, the relationships of interest are composition-element-value, property-value, and process-parameter-value relations. There are several challenges in extracting these entities and relations. Usually, the information is present in multiple forms such as text, tables, graphs, and so on. Figure 2 gives an example of typical text fragments present in a materials publication to illustrate how the information is present in textual form. It contains instances of manufacturing processes such as heating, quenching, tempering, and so on along with their process parameter values. Figure 3 shows a snapshot of a publication [3] in which elemental composition of the material is presented in the form of a table. Another complexity is many times the relations are not mentioned explicitly. They have to be inferred from the context. In fact, sometimes, the entity names may be totally missing. For instance, it is common to find sentences of the following kind:
The specimen is heated to 1800 °C and held for 4 min prior to water quenching it to room temperature at 100 °C/min.
Here, “100 °C/min” refers to cooling rate, but it is not mentioned explicitly. This has to be inferred from the context and added to the extracted information.
Once the relevant information is extracted, it needs to be stored in a suitable data structure to facilitate information retrieval. We use an inverted index for this purpose. The reason for choosing an inverted index over a more traditional database is that it enables us to leverage the power of full-text search in addition to value-based query processing. Recent advances in inverted indexing support indices on numeric fields and range queries over them. However, this support is limited to fields with point values. Fields with value ranges are not supported. This is a problem since sentences such as the following are common in publications [3]: The holding time was varied in the range of 1–300 s for comparison purposes. This publication should be retrieved when a user searches for any value between 1 and 300 s for holding time. To address this, we devised a scheme where we store lower and upper bounds in separate indices and use set-based operations to process the query.
The rest of the paper is organized as follows. In the “Related Work” section, we review related work from domains such as biomedical and chemical. In the “System Architecture” section, we describe the system architecture and explain various components such as extraction module, indexing module, query processor module, and search module. In the “Experimental Results” section, we present experimental results where we compare the performance of our search engine with that of a keyword-based search engine on a set of prototypical examples. We summarize the paper and discuss future work directions in the “Conclusions and Future Work” section.
Information retrieval and extraction are well-studied subjects and a lot of research has been conducted in these areas [4, 5]. Domain-specific search engines [6] typically employ spiders that crawl the web in a directed fashion to find domain relevant documents. They then extract characteristic pieces of information (entities and relations) from the collected documents to provide search functionality over the extracted information.
The biomedical community makes extensive use of text mining technology. Common entities of interest include gene and protein names, symptom and disease names, drug names, and so on. The domain offers a rich set of knowledge sources. For instance, the Unified Medical Language System (UMLS) [7] unifies entities from over 100 dictionaries, ontologies, and terminologies and provides semantic relations among them. Entity extraction techniques are then built on top of these knowledge sources. For instance, MetaMap uses UMLS for identifying biomedical entities in text. The relation extraction task in this domain mainly focuses on interaction between genes and proteins, proteins and point mutations, proteins and their binding sites, genes and diseases, and so on. A lot of information retrieval tools such as PubMed, GoPubMed, and EBIMed are built to search MEDLINE database which contains more than 22 million references to biomedical and life science journal articles [8]. Researchers in this domain have also developed enhanced IR techniques which, in addition to keyword-based search, provide concept-based article categorization and search filtering, query refinement, and so on (for example [2]). However, these techniques do not provide support for the kind of value-based retrieval required in materials engineering.
In the chemical domain, scientists often want to search for articles related to a particular chemical which is expressed as a formula. Domain-specific search engines are built for this task. For instance, ChemXSeer [9] uses sophisticated entity extraction techniques such as SVM and CRF to extract chemical formulae from text and then designs indexing and ranking schemes for the extracted formulas. Recently, the CHEMDNER [10] challenge organized by the BioCreative workshop focused on two tasks: chemical entity mention recognition (CEM task) and chemical document indexing (CDI task). Promising results were obtained for these tasks by the systems that utilized machine learning techniques with domain-specific rules (for example [11]). Similarly, Kim et al. [12] discuss a neural network-based approach for extracting material synthesis parameters. However, none of these approaches support complex searches based on combinations of value-relation-based conditions. Li et al. [13, 14] have developed a search engine for materials science domain. However, they focus only on Chinese material names and chemical formulae. They neither extract other entity types (material property, process, and parameter) nor provide the kind of value constraint-based publication retrieval discussed in this paper.

System Architecture

Figure 4 shows the architecture of our search engine. The engine has an extraction module that extracts entities, values, and entity-value relations from publications stored in a publication repository. A domain knowledge-guided post-processing module helps in resolving ambiguities that arise during extraction. The indexing module then builds an index on extracted entities and values, into the document fragments where they occur. The system has a query processor module which parses a user query and converts it to an equivalent query over the stored index. The search module then processes the index to retrieve the matching publications. The system then displays publication fragments highlighting the matched entities, values, and entity-value relations. The rest of the section describes these modules in detail.

Extraction Module

This module performs information extraction from materials publications. Figure 5 shows the outline of our extraction algorithm. We first describe the NLP techniques used by our algorithm for preprocessing the text content, followed by the algorithms for extraction of entities, values, and entity-value relations. We then describe how domain knowledge helps resolve some of the ambiguities in the extracted entities and relations.
Text Preprocessing Using NLP Techniques
As shown in Fig. 5, our system first converts the publications (pdf files) into textual form using Exegenix [15], a pdf to xml conversion tool. It then takes the raw publication text as input and applies tokenization and sentence splitting to convert the text into a sequence of sentences. This is followed by parts of speech (PoS) tagging, stemming, and dependency parsing. We use Stanford CoreNLP [16] to perform these tasks.
PoS tagging assigns to each token in the text its part of speech, e.g., noun, verb, and adjective. We use the maximum entropy algorithm for PoS tagging. It uses Penn Tree bank tag set. PoS tags are used as features for information extraction. For instance, a token having a noun PoS tag is more likely to be an entity than a token having a verb tag. A dependency parser analyzes the grammatical structure of sentences. It provides dependency relations between the words of a sentence. Figure 6 shows the dependency parse of a sentence. A dependency relation identifies a head word (governor) and the word that modifies it (i.e., a modifier or dependent). The type of the dependency relation specifies the grammatical relation between the words. For instance, nsubjpass relation between the words heated and material specifies that the word material is a passive nominal subject of a clause governed by the word heated. The Stanford typed dependency parser currently supports approximately 50 grammatical relation types. Dependency relations can be used in extraction rules or as features in machine learning-based extraction approaches. As a simple rule for value relation extraction, we can relate a word denoting a value instance to the word denoting its governor. For the example shown in Fig. 6, the rule would relate parameter value 900 °C with process heating (derived from heated by entity extraction).
Entity Extraction
We use a dictionary-based technique to recognize entities. We are interested in three types of entities:
  • Material composition: We represent composition as a set of element name and value pairs. The list of chemical element names is a closed set which is stored in a dictionary.
  • Material property: The list of property names such as tensile strength and hardness is again a closed set. We have collected property names from a materials science book and collated them in a dictionary.
  • Process and parameter: While the list of processes is not a closed one, it is not difficult to compile processes for known material categories (e.g., steel processes). We consulted catalogs and domain experts to compile such a list and built a dictionary from the list.
We look up these dictionaries1 to find entity occurrences. The lookup is performed over stemmed tokens of the publication text.
Value Extraction
We use PoS tagger to find value instances in text. PoS tagger assigns the tag CD (cardinal number) to all occurrences of numbers such as three, 30, and 1.5. However, not all instances of numbers specify entity values, for instance, figure and table numbers, numbers in the numbered list, numbers specified for citing previous work, and so on. To address this, we only consider those numbers which are followed by a unit. We use dictionary lookup to identify units. Value instances are also commonly present as ranges, e.g., the obtained microstructure exhibits UTS ranging from 1.77 to 2.2 GPa. We use regular expressions over token sequences to extract such ranges. We use TokensRegex API [17] from Stanford for this purpose. A few examples of range occurrences include “in the range of 100-200 K”, “ranging from 1.7 to 2.2 GPa”, and “ranges from 100 to 300 MPa”. The following is a simplified version of a regular expression to extract these value ranges: [(rang(e|ing) from|range of)] [word:IS_NUM;tag=CD] [word:/-|to/] [ner:value] [ner:unit].
Entity-Value Relation Extraction
We have developed a rule-based algorithm for entity-value relation extraction from text. The algorithm first extracts entities and values separately from the preprocessed text. It then applies a set of rules to identify, for each value node, the entity node with which the value should be related. Table 1 lists some of these rules. We use regular expressions over dependency graphs to implement these rules. We use the Semgrex API [18] from Stanford for this purpose. The extracted entities and relations from a publication are then passed to the indexing module for further processing.
Table 1
Rules for entity-value relation extraction from text
The rules below use distance in terms of number of edges in the path between two nodes in the dependency graph.
1. If there is an entity node E1 in the path from value node V1 to root node in the dependency graph, create relation {E1, V1}
2. Find an entity node E2 in the dependency graph such that it has the shortest distance from the value node V2 and the unit associated with V2 is valid for E2; create relation—{E2, V2}
3. If a value node V3 is mentioned in brackets and it is preceded by an entity node E3, create relation—{E3, V3}—we use TokensRegex regular expression:
(?$E3 [{ner:/property|process|parameter/}]) [/[|{|(/] (?$V3 [ner:value] ([/−|to/] [ner:value])?) [/]|}|)/]
The rules in Table 1 use the number of edges in the path between two nodes as a measure of distance between them. Rule 1 starts with a value node V1 and recursively traverses the path to the root node looking for a head node E1 that is identified as an entity. Thus it relates a value to the closest entity in the head-modifier relation path. A few sentence fragments where this rule applies are the following: “…tensile strength of 1000 MPa…”; “…heat treated at 300 °C for 30 min…”; “…80 percent cold reduction…”
Rule 2 relates value nodes to entity nodes that are not in head-modifier relation paths. Consider an example sentence from a publication [19]: The yield point of the specimens cooled according to this route achieves 635 MPa. Here, the value 635 MPa is not in head-modifier relation with the property yield point. Rather, they are related via the common ancestor node achieves. Hence, Rule 1 fails to identify this relation. Rule 2 then looks for all entity nodes which are reachable from 635 MPa and finds two nodes: yield point and cooled. It relates 635 MPa with yield point as it is closer to the value node than cooled. Rule 3 extracts value relations where a value node is mentioned in brackets preceded by an entity node. Table 1 shows the regular expression (in TokensRegex syntax) for identifying such relations. A few sentence fragments where this rule applies are as follows: “…hardening and high tensile strength (1000 MPa) with…”; “The extremely high elongations (80-95%) achieved…”
Information Extraction from Tables
Often values are also presented in tabular form. For instance, in many documents, material compositions and property values are given in tables, while the corresponding processing conditions are described in textual form. We use dictionary-based matching to extract information from tables. We match column headers (after suitable stemming) with dictionary entries to identify property and element names. We use a regular expression pattern to extract values from table cells. The pattern extracts values as well as value ranges such as 0.32, 0.3 − 0.5,  1550 to 1700,  980 ± 20,   ≥ 500. The algorithm then relates a column header entity with values extracted from the corresponding column cells.

Domain Knowledge-Guided Post-Processing

We have a domain knowledge repository (refer Fig. 4) that contains information on processes, process parameters, their units, constraints on their value ranges, and so on. Similar information exists on properties as well. We exploit this knowledge to resolve ambiguities in entity-value relations. For example, let us consider the following sentence given in section 1:
The specimen is heated to 1800 °C and held for 4 min prior to water quenching it to room temperature at 100 °C/min.
Rule 1 for entity-value relation extraction will associate the value 100 °C/min with process quenching, which is incorrect since this value is for parameter cooling rate. Knowledge repository contains information on quenching process, its parameters and the units that go with them. By processing this information, we can infer that the value whose unit is °C/min has to be for parameter cooling rate when it occurs in the context of a quenching process. This information is then used by our algorithm to correctly associate the value 100 with parameter cooling rate. Similarly, value constraints are also helpful in resolving ambiguities. For example, the value for carbon by weight percentage in steels can vary only in the range of 0 to 2%. This constraint can be used to avoid wrong values being related to carbon percentage in a material composition.
Our algorithm uses the following rules (in order) to associate values with the right entities:
1.
Unit constraint: Associate value V to an entity E only when the unit mentioned for V is in the allowable list of units for E.
 
2.
Value constraint: When multiple entities satisfy rule 1, select only those entities for which value V satisfies the value range constraint.
 
3.
If the ambiguity is still not resolved after applying rules 1 and 2, associate value V with all qualifying entities.
 
Rule 3 will result in some loss of extraction precision, since it associates a value with multiple entities. However, it will improve the recall of the overall search system. At the same time, loss of precision for the search system as a whole is minimal since most queries have multiple value-based constraints. While a document might incorrectly match an individual constraint, the probability of it incorrectly matching all constraints is rather low and hence not likely to be retrieved.

Indexing Module

Indexing is performed to optimize speed and performance in finding relevant documents from a search query. A keyword-based search engine collects, parses, and stores the data in an inverted index data structure. This data structure contains mapping from content such as words and numbers to its locations in the set of documents. Modern day search engines provide useful features such as ranking, phrase queries, proximity queries, range queries, fielded searching, and so on for full-text search. We leverage and build our search system on top of keyword-based search. We use Lucene APIs [20, 21] for this purpose.
Lucene internally stores each document as a list of fields. Each field has a name, value, and type (e.g. float, string, and integer). Lucene creates an index table for each unique field name. Thus, in our case, index tables are created for each composition element, property, process, and process-parameter name. For instance, an index table exists for each of silicon, yield strength, tempering, and so on. We use the numeric range query feature of Lucene to support value constraint-based queries (such as yield strength < 100 MPa). For each document D1 in a publication repository, we first create a corresponding Lucene document object LD1. Suppose we extract the value 40 MPa for material property yield strength from D1. We create a field <name = yield strength; value = 40; type = float> for this extraction and add it to LD1.
One challenge with this indexing approach is how do we index an entity-value relation when the value is specified as a range? Suppose the algorithm extracts the value range 50–60% for property elongation from document D2. When a user searches for a document with 52% elongation (or any value between 50 and 60), document D2 should be retrieved. This is not possible with a simple, point value-based index since we do not know which value of elongation should be indexed with D2. We use set-based operations to address this problem. In addition to creating an index for each entity name, we also create two additional indices for value ranges: entity_high and entity_low. The upper value from a value range is then indexed with entity_high and the lower value with entity_low. Continuing with our example, we create two additional fields:
  • F1: name = elongation_high; value = 60; type = Float
  • F2: name = elongation_low: value = 50; type = Float
Thus, the indexing module creates three index tables for each entity name: one index to store point values and two indices to store range values. Suitable set operations can then be formed using these index tables to answer user queries, as explained in the search module later.

Query Processor Module

We have designed a domain-specific query language to support entity value-based search requests. Text box 1 shows a subset of the grammar of this language, highlighting its key features. The basic unit of a query is an entity value constraint (represented by relationalTerm in the grammar). The language provides Boolean operators to build complex queries over such entity value constraints (default operator is AND). An entity value constraint may be specified either as a point value constraint or as a range value constraint. The range constraint may be specified either with lower and upper bounds (e.g., [1, 10]) or with relational operators such as >,   < ,   ≥ , or ≤. Value constraints on processing conditions are specified with process name followed by parameter value constraints in square brackets (refer relationalTerm in the grammar).The language also provides a way to combine value-based queries with keyword-based queries. A few example queries are as follows:
  • carbon [0.2, 0.3] & elongation > 0.4
  • tempering [time > 20 & temperature [200, 400]] & UTS [400, 500]
In the second example, the first part of the query is looking for a tempering process with tempering time > 20 and tempering temperature in the range 200–400.

Search Module

This module takes a user query written in the language discussed above and parses it into an equivalent query on the Lucene index. Let us look at a simple user query: “elongation [55, 65].” As mentioned earlier, we create three indices for each entity. For elongation, these indices are elongation, elongation_high, and elongation_low. To process the above user query, we generate the following queries on these indices:
1.
Query on index elongation: “elongation = [55, 65]”. This retrieves all the documents in which the point value of elongation is in the range 55–65.
 
2.
Query on index elongation_high: “elongation_high = [55, 65]”. This retrieves all the documents in which the value range of elongation has its higher bound in the range [55, 65]. All these documents satisfy the user query since their value ranges overlap with the value range of the user query, irrespective of their lower bounds.
 
3.
Query on index elongation_low: “elongation_low = [55, 65]”. This retrieves all the documents in which the value range of elongation has its lower bound in the range 55–65.
 
4.
Query: “elongation_high = [65,∞] AND elongation_low = [−∞,55].” This retrieves those documents in which the value range of elongation has its higher bound greater than 65 and its lower bound less than 55. For example, if we have a document D where the value range of elongation is 40 to 75, D will be retrieved by this query.
 
We refer to the above queries as Lucene queries. For each value constraint in the user query, multiple Lucene queries are generated. Each Lucene query returns a document set. These sets are then composed using set operations to find the document set for the value constraint. Table 2 lists the mapping from user query to Lucene queries for different types of value constraints. In the table, the “OR” operator is implemented using set union and the “AND” operator is implemented using intersection.
Table 2
Mapping from value constraint query to Lucene query
Value constraint query
Lucene query
entity > value
entity = [value, ∞] OR entity_high = [value, ∞]
entity < value
entity = [−∞, value] OR entity_low = [−∞, value]
entity: [lower, upper]
entity = [lower, upper] OR entity_high = [lower, upper] OR entity_low = [lower, upper] OR (entity_high = [upper, ∞] AND entity_low = [−∞, lower])
entity: value
entity = [value, value] OR entity_high = [value, value] OR entity_low = [value, value] OR [entity_high = [value, ∞] AND entity_low = [−∞, value]]
The document sets returned by the individual value constraints are again composed using set operators (intersection for Boolean operator “&” and union for Boolean operator “|”) to find the result set for the user query. Figure 7 shows the screenshots of our system for an example query. In the example, note that the system has identified 2 h as the tempering time even though “time” is not explicitly mentioned in the sentence. It has also carried out unit conversion to identify that 2 h > 20 min.

Experimental Results

In this section, we compare our domain-specific search system with keyword-based search in terms of information retrieval accuracy.
We define accuracy in terms of precision, recall, and F1-score.
  • Precision = fraction of retrieved documents that are relevant to the query
  • Recall = fraction of the documents that are relevant to the query that are successfully retrieved
  • F1-score =harmonic mean of precision and recall
There is a trade-off between precision and recall. If a system is fine-tuned to increase precision, then its recall decreases and vice-versa. F1-score calculates a weighted average of precision and recall and is used as a single measure to assess the accuracy of a system.

Dataset

For our experimentation, we restricted the focus to steels and selected publications that discuss mechanical properties and heat treatment processes of steels. We downloaded around 6000 publications from various journals such as Journal of Materials Science, Journal of Materials Processing Technology, Materials and Manufacturing Processes, ISIJ International, and Materials Characterization. We then filtered these publications for steel-related content using keyword-based searches such as “steel and heat treatment,” “steel and tensile strength,” “steel and casting,” and so on. From this filtered set, we randomly selected 180 publications for evaluation. The average number of pages in the evaluation set is 8.67 (with standard deviation of 3.80).

Evaluation

We have used a set of 20 queries to evaluate the system. The complexity of these queries ranges from simple single entity value constraints to combinations of constraints on multiple entity types. We have manually tagged each publication in our evaluation dataset for these queries. This tagged data indicates, for each query q, whether the publication matches the query or not. This information is then used for calculating the precision and recall of our system. Table 3 lists these queries. For each query, the table also lists the number of documents matching the query and compares it with the number of documents retrieved by our system.
Table 3
Queries used for evaluation—compares the number of documents matching the query with the number of documents retrieved by the system
Query
Matching documents
Retrieved documents
Carbon [0.2, 0.5] & Si > 0.25
35
26
Al: [0.02, 0.04]
43
28
P: [0.01, 0.02] & Cr > 0.23
19
13
C: [0.05, 0.5] & Si [0.02, 0.1]
19
12
yield strength > 500 & elongation
31
15
elongation > 10 & tensile strength < 1000
22
11
elongation [10, 50] & tensile strength [500, 1000]
18
7
yield strength > 500 & tensile strength < 1000
16
13
uts : [500,1500] & elongation < 70
28
9
C > 0.1 & tensile strength > 500
31
23
C [0.1, 0.5] & Si > 0.1 & tensile strength > 500
27
19
tensile strength > 1000 & C > 0.2
14
9
ductility < 70 & Si
13
21
cooling [rate > 50]
17
21
heating [time < 15] & annealing
10
22
tempering[temperature > 200] & quenching[temperature < 500]
9
13
hot rolling & Carbon > 0.05
58
34
cooling [rate > 20] & Si [0.05, 0.5]
13
18
annealing [temperature > 200] & tensile strength > 200
21
17
The semantics of keyword-based search assume “OR” as the default Boolean operator between query phrases and rely on the ranking mechanism to present the most relevant documents (containing all query phrases) ahead in the retrieved list followed by the documents that miss some of the phrases. However, in our query list, we want all the phrases to be present in the retrieved list of publications. Hence, we compare our system with two variants of keyword-based search: query phrases with Boolean operator “AND” between them (referred as Keyword-And) and query phrases with Boolean operator “OR” (referred as Keyword-Or). Let us consider the scenario mentioned in the “Introduction” section where we are looking for a material with carbon in the range 0.5 to 0.8 and achieves hardness of 50 Rc and above. In our domain-specific language, this intent is represented by the query: “Carbon: [0.5, 0.8] & hardness > 50 Rc”. To represent the same query for keyword-based search, we remove the special characters and treat domain entities as phrases. The two variants of keyword-based search are then represented as follows:
  • Keyword-And: Carbon & 0.5 & 0.8 & hardness
  • Keyword-Or: Carbon | 0.5 | 0.8 | hardness
Table 4 compares our system with the keyword-based search variants. As mentioned, the dataset contains publications from the steel domain. For this set, it turns out that at least one of the query phrases (for example, carbon and strength) is present in a large fraction of the publications. This results in retrieval of lots of publications for most of the queries by the Keyword-Or system, as evident from the high recall value shown in the table. This system is obviously not useful since the user has to go through many publications to find the required pieces of information. Keyword-And system on the other hand retrieves a publication only when all the query phrases are present in it. However, there are several issues with this system. First, it retrieves a publication even though the entities and values are unrelated but simply present in different parts of the publication. This results in a large precision error. Secondly, due to the absence of value relations, it cannot handle range queries. This results in non-retrieval of a lot of publications that actually match the user intent. Thus, it also results in high recall errors. Our system addresses both of these issues. In addition to ensuring that all query phrases are present in a publication, it also checks the value relation constraints and considers a publication only when these constraints are satisfied, resulting in a fairly accurate retrieval system. This is evident from the precision and recall values reported in Table 4.
Table 4
Comparison of our system with keyword-based search system
Accuracy measure
Our system
Keyword-based search with
AND
OR
Precision
0.9259
0.3023
0.1497
Recall
0.5669
0.1769
0.8928
F1-score
0.7032
0.2232
0.2564

Conclusions and Future Work

In this paper, we have presented a domain-specific search engine for mining materials science literature. The engine provides a domain-specific query language using which a user can perform value constraint-based queries on materials entities. Results show that it performs significantly better than keyword-based search engines by intelligently processing entity-value relations. We believe such a search engine will be of significant value to the materials science and engineering community.
We are currently working on further improving the accuracy of the engine. Our relation extraction algorithm currently analyzes only sentence level relations. Many times relations are present across sentences. For instance, a publication might explain a few parameters of a process in one sentence and a few others in the subsequent sentence. We are working on co-reference resolution techniques to address this problem. The accuracy of the proposed entity extraction algorithm also depends on the quality of the dictionary used. We have a complete dictionary for element names (material composition). We also have a comprehensive list of material properties. However, the dictionary for process and parameter names is not complete. We are currently working on unsupervised approaches for automatic dictionary creation.
In materials science publications or reports, a lot of useful information is also present in the form of graphs and images. To be able to search this information, we should at least be able to extract meta-data about the plot, such as labels of X, Y axes, ranges of values in the X, Y axes, and so on. We are planning to integrate plot digitization techniques to support this functionality.
Data Availability
The dictionaries for material properties, compositions, and processing conditions used by the search engine presented in this article are available at the NIST repository (materialsdata.nist.​gov) [23].
This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
Fußnoten
1
These dictionaries are available at the NIST materials data repository [23].
 
Literatur
1.
Zurück zum Zitat National Research Council (2008) Integrated Computational Materials Engineering: a transformational discipline for improved competitiveness and national security. The National Academies Press, Washington, D.C. National Research Council (2008) Integrated Computational Materials Engineering: a transformational discipline for improved competitiveness and national security. The National Academies Press, Washington, D.C.
5.
Zurück zum Zitat Sarawagi S (2008) Information extraction. Found Trend Database 1(3):261–377CrossRef Sarawagi S (2008) Information extraction. Found Trend Database 1(3):261–377CrossRef
6.
Zurück zum Zitat Mccallum, A., Nigam, K., Rennie, J., & Seymore, K (1999) Building domain-specific search engines with machine learning techniques. Proc. AAAI-99 Spring Symposium on Intelligent Agents in Cyberspace Mccallum, A., Nigam, K., Rennie, J., & Seymore, K (1999) Building domain-specific search engines with machine learning techniques. Proc. AAAI-99 Spring Symposium on Intelligent Agents in Cyberspace
7.
Zurück zum Zitat Lindberg D, Humphreys B, McCray A (1993) The unified medical language system. Methods Inf Med 32(4):281–291 Lindberg D, Humphreys B, McCray A (1993) The unified medical language system. Methods Inf Med 32(4):281–291
9.
Zurück zum Zitat Mitra, P., Giles, C. L., Sun, B., & Liu, Y (2007) ChemXSeer: a digital library and data repository for chemical kinetics. Proceedings of the ACM first workshop on CyberInfrastructure. Lisbon, Portugal: ACM Mitra, P., Giles, C. L., Sun, B., & Liu, Y (2007) ChemXSeer: a digital library and data repository for chemical kinetics. Proceedings of the ACM first workshop on CyberInfrastructure. Lisbon, Portugal: ACM
13.
Zurück zum Zitat Yang L, Chang-Jun H, Zhang J-L (2013) Matsearch: a search engine in materials science distributed data-intensive environment. J Internet Technol 14(5):799–806 Yang L, Chang-Jun H, Zhang J-L (2013) Matsearch: a search engine in materials science distributed data-intensive environment. J Internet Technol 14(5):799–806
14.
Zurück zum Zitat Yang, L., & Hu, C. (2013). A new evaluation model to building materials science domain-specific search engine. Fourth International Conference on EIDWT, (pp. 527–534). Xi'an, Shaanxi, China Yang, L., & Hu, C. (2013). A new evaluation model to building materials science domain-specific search engine. Fourth International Conference on EIDWT, (pp. 527–534). Xi'an, Shaanxi, China
17.
Zurück zum Zitat Chang AX, Manning CD (2014) TokensRegex: defining cascaded regular expressions over tokens. Department of Computer Science, Stanford University Technical Report Chang AX, Manning CD (2014) TokensRegex: defining cascaded regular expressions over tokens. Department of Computer Science, Stanford University Technical Report
18.
Zurück zum Zitat Chambers N, Cer D, Grenager T, Hall D, Kiddon C, MacCartney B et al (2007) Learning alignments and leveraging natural logic. Association for Computational Linguistics, Prague, pp 165–170 Chambers N, Cer D, Grenager T, Hall D, Kiddon C, MacCartney B et al (2007) Learning alignments and leveraging natural logic. Association for Computational Linguistics, Prague, pp 165–170
19.
Zurück zum Zitat Adamczyk J, Grajcar A (2007) Heat treatment and mechanical properties of low-carbon steel with dual-phase microstructure. J Achiev Mater Manuf Eng 22(1):13–20 Adamczyk J, Grajcar A (2007) Heat treatment and mechanical properties of low-carbon steel with dual-phase microstructure. J Achiev Mater Manuf Eng 22(1):13–20
20.
Zurück zum Zitat McCandless M, Hatcher E, Gospodnetic O (2010) Lucene in action, 2nd edn. Manning Publications Co., ISBN: 1933988177, 9781933988177 McCandless M, Hatcher E, Gospodnetic O (2010) Lucene in action, 2nd edn. Manning Publications Co., ISBN: 1933988177, 9781933988177
23.
Metadaten
Titel
A Relation Aware Search Engine for Materials Science
verfasst von
Sapan Shah
Dhwani Vora
B. P. Gautham
Sreedhar Reddy
Publikationsdatum
09.01.2018
Verlag
Springer International Publishing
Erschienen in
Integrating Materials and Manufacturing Innovation / Ausgabe 1/2018
Print ISSN: 2193-9764
Elektronische ISSN: 2193-9772
DOI
https://doi.org/10.1007/s40192-017-0105-4

    Marktübersichten

    Die im Laufe eines Jahres in der „adhäsion“ veröffentlichten Marktübersichten helfen Anwendern verschiedenster Branchen, sich einen gezielten Überblick über Lieferantenangebote zu verschaffen.