8.1 Introduction
8.1.1 History of Component-Based Evaluation in QA
8.1.2 Contributions of NTCIR
Gold standard input | AType accuracy (%) | TM accuracy (%) | RS top 15 (%) | IX top 100 (%) | MRR | Overall top 1 R (%) | Top 1 R+U (%) | |
---|---|---|---|---|---|---|---|---|
EC | None | 86.5 | 69.3 | 30.5 | 30.0 | 0.130 | 7.5 | 9.5 |
EC | TM | 86.5 | − | 57.5 | 50.0 | 0.254 | 9.5 | 20.0 |
EC | TM+AType | − | − | 57.5 | 50.5 | 0.260 | 9.5 | 20.5 |
EC | TM+AType+RS | − | − | − | 63.0 | 0.489 | 41.0 | 43.0 |
EJ | None | 93.5 | 72.6 | 44.5 | 31.5 | 0.116 | 10.0 | 12.5 |
EJ | TM | 93.5 | − | 67.0 | 41.5 | 0.154 | 9.5 | 15.0 |
EJ | TM+AType | − | − | 68.0 | 45.0 | 0.164 | 10.0 | 15.5 |
EJ | TM+AType+RS | − | − | − | 51.5 | 0.381 | 32.0 | 32.5 |
8.2 Component-Based Evaluation in NTCIR
8.2.1 Shared Data Schema and Tracks
-
Topic format: The organizer distributes topics in this format for formal run input to IR4QA and CCLQA systems.
-
Question Analysis format: CCLQA participants who chose to share Question Analysis results submit their data in this format. IR4QA participants can accept task input in this format.
-
IR4QA submission format: IR4QA participants submit results in this format.
-
CCLQA submission format: CCLQA participants submit results in this format.
-
Gold-Standard Format: Organizer distributes CCLQA gold-standard data in this format.
-
Question Analysis Track: Question Analysis results contain key terms and answer types extracted from the input question. These data are submitted by CCLQA participants and released to IR4QA participants.
-
CCLQA Main Track: For each topic, a system returned a list of system responses (i.e., answers to the question), and human assessors evaluated them. Participants submitted a maximum of three runs for each language pair.
-
IR4QA+CCLQA Collaboration Track (obligatory): Using possibly relevant documents retrieved by the IR4QA participants, a CCLQA system-generated QA results in the same format used in the main track. Since we encouraged participants to compare multiple IR4QA results, we did not restrict the maximum number of collaboration runs submitted and used automatic measures to evaluate the results. In the obligatory collaboration track, only the top 50 documents returned by each IR4QA system for each question were utilized.
-
IR4QA+CCLQA Collaboration Track (optional): This collaboration track was identical to the obligatory collaboration track, except that participants were able to use the full list of IR4QA results available for each question (up to 1000 documents per-topic).
8.2.2 Shared Evaluation Metrics and Process
8.2.2.1 Human-in-the-loop Evaluation Metrics
8.2.2.2 Automatic Evaluation Metrics
Algorithm | Token | Per-run (N \(=\) 40) Pearson | Per-run (N \(=\) 40) Kendall | Per-topic (N \(=\) 40\(\,\times \,\)100) Pearson | Per-topic (N \(=\) 40\(\,\times \,\)100) Kendall |
---|---|---|---|---|---|
Exactmatch | Char | 0.4490 | 0.2364 | 0.5272 | 0.4054 |
Softmatch | Char | 0.6300 | 0.3479 | 0.6383 | 0.4230 |
Binarized | Char | 0.7382 | 0.4506 | 0.6758 | 0.5228 |
8.3 Recent Developments in Component Evaluation
8.3.1 Open Advancement of Question Answering
To support this vision of shared modules, dataflows, and evaluation measures, an open collaboration will include a shared logical architecture—a formal API definition for the processing modules in the QA system, and the data objects passed between them. For any given configuration of components, standardized metrics can be applied to the outputs of each module and the end-to-end system to automatically capture system performance at the micro and macro level for each test or evaluation. (Ferrucci et al. 2009b)
A group of eight universities followed these principles in collaborating with IBM Research to develop the Watson system for the Jeopardy! challenge (Andrews 2011). The Watson system utilized a shared, modular architecture which allowed the exploration of many different implementations of question-answering components. In particular, hundreds of components were evaluated, as part of an answer-scoring ensemble that was used to select Watson’s final answer for each clue (Ferrucci et al. 2010).By designing and building a shared infrastructure for system integration and evaluation, we can reduce the cost of interoperation and accelerate the pace of innovation. A shared logical architecture also reduces the overall cost to deploy distributed parallel computing models to reduce research cycle time and improve run-time response. (Ferrucci et al. 2009b)
8.3.2 Configuration Space Exploration (CSE)
-
How can we formally define a configuration space to capture the various ways of configuring resources, components, and parameter values to produce a working solution? Can we give a formal characterization of the problem of finding an optimal configuration from a given configuration space?
-
Is it possible to develop task-independent open-source software that can easily create a standard task framework and incorporate existing tools and efficiently explore a configuration space using distributed computing?
-
Given a real-world information processing task, e.g., biomedical question answering, and a set of available resources, algorithms, and toolkits, is it possible to write a descriptor for the configuration space, and then find an optimal configuration in that space using the CSE framework?
Category | Components |
---|---|
NLP tools | LingPipe HMM-based tokenizer LingPipe HMM-based POS tagger LingPipe HMM-based named entity recognizer Rule-based lexical variant generator |
KBs | UMLS for syn/acronym expansion EntrezGene for syn/acronym expansion MeSH for syn/acronym expansion |
Retrieval tools | Indri system |
Reranking algorithms | Important sentence identification Term proximity-based ranking Score combination of different retrieval units Overlapping passage resolution |
8.3.3 Component Evaluation for Biomedical QA
TREC 2006 | CSE | |
---|---|---|
No. components | 1,000 | 12 |
No. configurations | 1,000 | 32 |
No. traces | 92 | 2,700 |
No. executions | 1,000 | 190,680 |
Capacity (hours) | N/A | 24 |
DocMAP max | 0.5439 | 0.5648 |
DocMAP median | 0.3083 | 0.4770 |
DocMAP min | 0.0198 | 0.1087 |
PsgMAP max | 0.1486 | 0.1773 |
PsgMAP median | 0.0345 | 0.1603 |
PsgMAP min | 0.0007 | 0.0311 |
TREC 2007 | CSE | |
---|---|---|
DocMAP max | 0.3286 | 0.3144 |
DocMAP median | 0.1897 | 0.2480 |
DocMAP min | 0.0329 | 0.2067 |
PsgMAP max | 0.0976 | 0.0984 |
PsgMAP median | 0.0565 | 0.0763 |
PsgMAP min | 0.0029 | 0.0412 |