Introduction
Related works
Data integration
Privacy protection
Preliminary
System architecture
-
The RESTful interfaces are for the client-side application to access the cloud services. They are designed for automatic LDI, while manual one is via GUIs(Graphical User Interfaces).
-
The Knowledge Support components provide users with a knowledge query on the knowledge graph: correlation among entities such as cases, laws, crimes, and penalties extracted from arbitration cases.
-
The local database is deployed on the internal server of the arbitral court and is responsible for storing the original arbitration data with private information. In contrast, the cloud database stores the masked arbitration data and the knowledge graph.
Data flows
Legacy data
No | File Name | Page Number |
---|---|---|
1 | Case Acceptance Approval Form | 1–1 |
2 | Arbitration Application Form | 2–2 |
3 | Copy of Applicant’s ID Card | 3–3 |
4 | Identity Certificate of Attorney | 4–4 |
5 | Copy of Respondent’s ID Card | 5–5 |
6 | Copy of Applicant’s ID Card | 6–10 |
7 | Evidence Submitted by Applicant | 11–15 |
8 | Evidence | 16–17 |
9 | Confirmation Info | 18–18 |
10 | Court Record | 19–21 |
11 | Arbitration Award | 22–23 |
12 | Copy of Arbitration Award | 23–24 |
13 | Service Return Receipt | 25–26 |
14 | Appendix | 27–29 |
Cloud database
Manual LDI
-
Laborious manual integration. Many legacy documents have long text content, and the integration work requires the staff to understand the documents thoroughly.
-
Work efficiency problem. According to the practical experience of Manual LDI, as the integration work progresses, staff may be tempted to disregard the integration rules to reduce their workload, resulting in a poor integration effect.
-
Differences in the understanding of LDI rules. It is difficult for staff to fully understand the integration rules, which leads to the failure to achieve the expected data integration effect.
Manual privacy filtering
-
Roughness. Manual privacy filtering is rough and does not consider the trade-off between data quality and privacy filtering. The staffs merely process the privacy data items that satisfy the simple filtering rules, without comprehensively considering the filtering strategy for data items from the perspective of the overall data distribution and the impact of privacy filtering on data quality.
-
Privacy leakage. Despite the imposition of strict access controls, privacy breaches may still occur during the privacy filtering process, as it is inevitable that staff members will need to access sensitive data.
AI-enabled legacy data integration
Paper document conversion
-
Printed. These paper documents are printed copies of electronic versions. Documents of the same type have a unified and standardized format, and the font is clear and recognizable. This type of document can be easily digitized and extracted.
-
Handwritten. This kind of document comes from earlier arbitration cases, and the relevant people handwrite the contents. Diversified writers may significantly affect the fonts, format, writing style, and expression of the same document type.
-
Mixed. These paper documents mix the printed and handwritten contents. They are more unambiguous and more distinguishable than the handwritten ones. They are commonly printed forms and statements filled or extended by relevant people’s handwriting.
Key information extraction
Well-formatted data extraction
-
Fewer recognition errors and wrong words reduce the information extraction difficulty.
-
Clear content structure and boundaries make SGs recognizable. For example, SGs of the arbitration application, such as personal information and application request, are separated by a title line.
-
Key information, such as personal information at the beginning of the arbitration application, is as structured as the key-value pair.
Poorly-formatted data extraction
Content partition
Named entity recognition (NER)
Named Entity Group | Named Entities | Train | Validate | Test |
---|---|---|---|---|
Applicant/ Respondent | gender(A_GEN), age(A_AGE), ID(A_ID), occupation(A_OCC), nationality(A_NAT) | 1200 | 200 | 232 |
Court Record and Arbitration Award | attorney(C_ATO), arbitrator(C_ARB), secretary (C_SEC), clerk(C_CLE), law(C_LAW), case code(C_COD), case reason(C_CAS) | 1300 | 150 | 123 |
Evidence Detail | evidence(E_EVI), verified(E_VER), discussion(E_DIS) | 200 | 30 | 30 |
General Entity | date(G_DATE), location(G_LOC), person name(G_PER), organization name(G_ORG), company name(G_COM) | 1800 | 300 | 300 |
Extraction Example |
---|
The applicant, Zhang San [G_NAM], submitted an arbitration application on August 26th, 2021. The applicant is 36 years old [A_AGE], of Han [A_NAT], with an ID card number of 12345xxxx567 [A_ID], and male [A_GEN]. The applicant currently resides in Hunnan District, Shenyang City, Liaoning Province [A_OCC] |
Sales contract dispute [C_CAS] with the case number of Fushun Arbitration Committee 2020 No. 032 [C_CODE]. The court session was held on June 8th, 2020 [G_DATE] at the Fushun Arbitration Committee [G_ORG]. Attendees included arbitrator Li Si [C_ARB], secretary Wang Wu [C_SEC], and clerk Zhao Liu [C_CLE] |
The applicant provided evidence: 7 screenshots of WeChat chat records [C_EVI], which prove that the respondent did not fulfill the contract [C_CLE] |
According to Article 22, Article 31 of the Arbitration Law of the People’s Republic of China [C_LAW], Fuwa Heavy Industry Machinery Co., Ltd. [G_COM], to return 87,000.00 yuan (eighty-seven thousand) to the applicant, Jicheng Electric Manufacturing Co., Ltd. [G_COM], within 7 days [G_DATE] |
Data integration
AI-enabled privacy protection
Database privacy protection
No | Id | Age | Zip code | Gender | Case |
---|---|---|---|---|---|
(a) Original data of the PARTICIPANT_INFO table | |||||
1 | 123 | 25 | 110000 | male | contract dispute |
2 | 124 | 29 | 113000 | female | property dispute |
3 | 125 | 35 | 118000 | female | property dispute |
4 | 126 | 36 | 122000 | male | labor dispute |
5 | 127 | 43 | 124000 | male | contract dispute |
6 | 128 | 30 | 113000 | female | contract dispute |
7 | 129 | 27 | 115000 | male | labor dispute |
8 | 130 | 55 | 125000 | male | labor dispute |
(b) t-Closeness anonymous data (t = 0.25, k = 2) | |||||
1 | Null | 2* | 11* | male | contract dispute |
2 | Null | 2* | 11* | female | property dispute |
7 | Null | 2* | 11* | female | labor dispute |
3 | Null | [30, 37] | 1* | male | property dispute |
4 | Null | [30, 37] | 1* | male | labor dispute |
6 | Null | [30, 37] | 1* | female | contract dispute |
5 | Null | > = 38 | 12* | male | contract dispute |
8 | Null | > = 38 | 12* | male | labor dispute |
Text field privacy protection
Table Column | Filtered Example |
---|---|
APPLICATION_ATTACH.desctiption | The respondent (Zhang San [G_NAM] -> Zhang*) borrowed 150,000 yuan from us on (August 15th, 2021 [G_DATE] -> “x year x month x day”) (Fushun Fertilizer Company [G_ORG] -> “* company”)), is aware of this matter but has not taken any action |
COURT_RECORD.argue_info | The main point of dispute between the two parties is whether the oral agreement on interest between the applicant (Zhang San [G_NAM] -> Zhang*) and the respondent (Li Si [G_NAM] -> Li*) is valid, as well as whether (Fushun Fertilizer Company [G_ORG] -> “* company”) belongs to has joint liability |
REPLY_BRIEF.confirmed | We acknowledge that on (November 15th, 2021 [G_DATE] -> “x year x month x day”), (Zhang San [G_NAM] -> Zhang *) promised a interest rate of 5% |
EVIDENCE_RECORD.description | This evidence is the loan agreement signed by (Zhang San [G_NAM] -> Zhang*) and the respondent (Li Si [G_NAM] -> Li*) on (November 15th, 2021 [G_DATE] -> “x year x month x day” in (Heping District [G_LOC]- > xxxx) of (Shenyang [G_LOC]->xxxx) |
Complexity analysis
Functional evaluation
Setup
No | Data type | Labels |
---|---|---|
1 | Arbitration Application Form | applicant_sec, request_ sec, case_ sec, appendix_sec |
2 | Court Record | informatiob_ sec,, dispute_sec confrontation_sec, evidence_sec |
3 | Evidence | information_sec, detail_sec, verified_sec, appendix_sec |
4 | Arbitration Award | start_sec, participant_sec, process_sec, case_sec, opinion_sec, law_sec |
AI models evaluation
Model | Type No | P(%) | R(%) | F1(%) |
---|---|---|---|---|
Classification(type) | 1 | 94.17 | 93.45 | 93.80 |
2 | 89.76 | 82.47 | 85.96 | |
3 | 96.32 | 93.34 | 94.81 | |
4 | 83.44 | 84.01 | 83.72 | |
avg | 90.92 | 88.31 | 89.57 | |
Classfication(one) | 89.21 | 83.42 | 86.22 | |
NER | 82.54 | 78.24 | 80.33 |
Privacy protection evaluation
Human-AI comparison
Setup
Two competitors
-
Manual LDI team. The team consisted of an arbitration expert, a system administrator for Cloud Arbitration Court, and four internships. The former two are responsible for formulating integration rules, and the latter is responsible for performing integration tasks.
-
The AI team. The team consisted of two internships for training the model, writing scripts, and collecting program output.
Correctness of AI results comparing manual results
Accuracy of AI-enabled LDI
Recognition errors
-
Failed to extract target information. This type of error typically implies a missing attribute of the integrated data item. For example, name information sometimes appears in a handwritten form, resulting in a challenge for the AI-enabled method to recognize and extract. Consequently, the “name” field in the PARTICIPANT_INFO table may be empty.
-
Integrating wrong data. This type of error occurs more frequently than the previous one and typically implies setting table fields to inaccurate values. For the example of court records, model semantic understanding bias results in the erroneous identification of the applicant as the arbitration agent, thereby importing applicant information records into the ATTORNEY_INFO table.
Qualitative comparison
Aspects | AI | Manual |
---|---|---|
Error Rate | ||
Pro. | The error rate is relatively stable and does not vary with changes in workload | Integration errors are relatively fewer and smaller |
Con. | Limited by AI model ability, there is a tendency for more integration errors to occur. | As the workload increases, the probability of integration errors occurring also increases. |
Integration Consistency | ||
Pro. | Fixed AI model and program ensure consistency in the effectiveness of integration. | It is more flexible and facilitates rapid adaptation to new integration rules. |
Con. | Lack of flexibility makes it difficult to respond to changes in integration requirements. | Different understandings of integration rules among staff members lead to poor consistency. |
Cost | ||
Pro. | Overall, it saves a significant amount of labor and time. | No outside staff participation is required. |
Con. | Additional computer experts are needed to design and write relevant programs. | More labor cost and time consumption. |
Difficulty of quality management | ||
Pro. | Locating bugs from program output and logs is relatively simple | By communicating with relative staff, the cause of errors and solutions can be quickly determined |
Con. | Lack of interpretability of the AI model leads to integration results that cannot be explained, and errors cannot be tracked. | Locating errors requires interaction with humans, which is more complex and less predictable. |
Quantitative comparision
Accuracy
-
Data table perspectiveThis perspective primarily focuses on analyzing data integration accuracy (Acc) in the data tables. For example, the APPLICATION table and PARTICIPANT_INFO are extracted from documents of the Arbitration Application Form. The Acc is as high as about 0.83 since these documents have a clear format and are relatively short. On the contrary, the COURT_RECORD table and ARBITRATION_ AWARD table have heavy demands on the semantic understanding ability. The accuracy decreases to 0.60 ~ 0.73 because the tables have a more complex data format, and the data are extracted from longer paragraphs.
-
Source data perspectiveWell-Formatted Data has Acc between 0.85 and 0.93, indicating that the AI-enable LDI effectively solves the integration problem on such source data. In contrast, Poorly-Formatted Data only reaches Acc from 0.61 to 0.78, indicating that the extraction accuracy on such source data still has space to improve.
-
Overall perspectiveThe results in Table 9 demonstrate that AI-enabled LDI can achieve an overall recognition accuracy of 0.67 ~ 0.80. As the threshold increases, the recognition accuracy decreases, with the lowest accuracy of 0.67 occurring when ɛ = 0.4.
Dimension | Integrated Rows | Integrated Columns | Acc (ɛ = 0.4) | Acc (ɛ = 0.6) | Acc (ɛ = 0.8) | |
---|---|---|---|---|---|---|
Data table | APPLICATION | 2000 | 7 | 0.90 | 0.85 | 0.76 |
PARTICIPANT_INFO | 3623 | 10 | 0.83 | 0.83 | 0.83 | |
COURT_RECORD | 2835 | 6 | 0.73 | 0.68 | 0.60 | |
ARBITRATION_AWARD | 2000 | 6 | 0.76 | 0.70 | 0.65 | |
Source Data | Well-Formatted Data | 0.93 | 0.87 | 0.85 | ||
Poorly-Formatted Data | 0.78 | 0.72 | 0.61 | |||
Overall | 0.80 | 0.74 | 0.65 |
Time consumption
Integration Type | Stage1: Prepare | Stage2: Integration | Total |
---|---|---|---|
Manual | 64 | 280 | 344 |
AI-enabled | 120 | 20 | 140 |
Saved | -56 | 260 | 204 |
Saved Rate | -84% | 92% | 59% |