nach oben

Artificial Intelligence and Law

Open Access 11.05.2024 | Original Research

Japanese tort-case dataset for rationale-supported legal judgment prediction

verfasst von: Hiroaki Yamada, Takenobu Tokunaga, Ryutaro Ohara, Akira Tokutsu, Keisuke Takeshita, Mihoko Sumida

Erschienen in: Artificial Intelligence and Law

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Patentsuche

Aus

Abstract

This paper presents the first dataset for Japanese Legal Judgment Prediction (LJP), the Japanese Tort-case Dataset (JTD), which features two tasks: tort prediction and its rationale extraction. The rationale extraction task identifies the court’s accepting arguments from alleged arguments by plaintiffs and defendants, which is a novel task in the field. JTD is constructed based on annotated 3477 Japanese Civil Code judgments by 41 legal experts, resulting in 7978 instances with 59,697 of their alleged arguments from the involved parties. Our baseline experiments show the feasibility of the proposed two tasks, and our error analysis by legal experts identifies sources of errors and suggests future directions of the LJP research.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

1 Introduction

Legal information processing aims to provide computational aid in legal procedures, including predicting the outcome of a case through legal judgment prediction (LJP). LJP is beneficial to both legal professionals and the general public. It allows them to predict litigation outcomes in advance, and they can take better actions based on expectations. For example, they can get faster conciliation and smoother negotiations, resulting in more efficient legal services. LJP also reduces the cost of legal services and improves their access. Easier access to justice is important for those with limited or no access to traditional legal services.

LJP has been a longstanding research topic in artificial intelligence and, like other domains, has adopted machine learning (ML) techniques. The ML techniques require large datasets for training and evaluating ML models. Xiao et al. (2018) proposed a dataset of 2.6M Chinese criminal cases annotated with applicable laws, charges, and prison terms. Chalkidis et al. (2019) presented a dataset of 11.5K cases from the European Court of Human Rights, which is designed for violated article detection and case importance prediction. Katz et al. (2017) constructed a dataset of 28K cases from the Supreme Court of the United States. Semo et al. (2022) released the LJP dataset focused on class action cases in the United States. Chalkidis et al. (2022) proposed a collection of datasets to evaluate model performance across different legal tasks, including LJP tasks in English. In contrast, there is no LJP dataset employing real judgment documents in the Japanese jurisdiction. The LJP tasks and their datasets should be designed to reflect differences in jurisdictions. Against this backdrop, we construct the first dataset of LJP, Japanese Tort-case Dataset (JTD), for the Japanese LJP research.

In JTD, we deal with judgment on civil cases about torts (Civil Code, Art. 709).¹ Tort is an important and popular topic in civil cases. Japanese law affirms a tort as a negligent or intentional infringement of rights or legal interests that cause a plaintiff to suffer loss or harm. In modern society, torts play an important role in disputes on the internet, for example, cases of defamation and privacy infringement on social media. In such cases, tort law is often used to determine liability since there is usually no explicit contract between the parties.

Figure 1 shows an overview of our two tasks: Tort Prediction (TP) and Rationale Extraction (RE). A tort case involves two parties: plaintiffs and defendants. Plaintiffs are claimants of the case, arguing that a defendant’s action is a tort, while defendants contest plaintiffs’ arguments. TP predicts whether a tort is affirmed (T, a Boolean value), given undisputed facts (U) and arguments from both parties (P from plaintiffs and D from defendants). Undisputed facts are not disputed by any parties or agreed upon by both parties. They provide the LJP model with context to validate the parties’ arguments. The final decision on a tort (T) should be based on the arguments that are accepted by the judge. Thus the accepted arguments can be considered rationales for the final decision (T). RE identifies the accepted arguments ($R^P$ for plaintiffs and $R^D$ for defendants, both are sequences of Boolean values, denoting accepted arguments as True) in the parties’ arguments (P and D). To summarise, our tasks take (U, P, D) as input and output $(T, R^{P}, R^{D})$.

Figure 2 shows an example of an instance. There is an undisputed fact (U), four claims from the plaintiff (P), and one claim from the defendant (D). A gold standard for the tort prediction task is false ($T_{gold}$), meaning the subject of this instance is not considered a tort. Gold standard labels for the rationale extraction task are $\{True, True, False, False\}$ for the plaintiff ($R^{P}_{g}$) and $\{False\}$ for the defendant ($R^{D}_{g}$).

Our main contributions are the following. We propose new tasks for the Japanese LJP, which consist of judicial decision prediction and identification of their rationales. We conducted a large-scale annotation with 41 legal experts. In the annotation, the annotators captured direct causal relations between the court decisions and arguments from the parties, allowing multiple court decisions on multiple subject matters in a case. From the 3477 annotated documents, we built JTD consisting of 7978 tort-related instances. To establish baseline performance with the dataset, we conduct experiments employing hierarchical Transformer architecture and multi-task learning approaches. Moreover, we perform a detailed error analysis by legal experts to identify sources of errors and suggest future research directions. Our dataset will be available for researchers on a website.

ML-based approaches have been popular in LJP research. ML-based systems automatically learn how judges make decisions from a large number of cases (e.g., judgment documents). These models take fact descriptions as input and predict outcomes or relevant laws. European Court of Human Rights cases are popular sources of datasets for LJP (Aletras et al. 2016; Medvedeva et al. 2018; Chalkidis et al. 2019; Valvoda et al. 2023). Galli et al. (2022) constructed a dataset for an outcome prediction task for Italian Value Added Tax decisions. Katz et al. (2017) built a dataset of cases from the Supreme Court of the United States to train their models. Semo et al. (2022) constructed another LJP dataset of class action cases in the United States. LJP on Chinese Criminal cases is another big venue of ML-based LJP models (Luo et al. 2017; Zhong et al. 2018; Hu et al. 2018; Long et al. 2019; Xu et al. 2020). A Japanese dataset for legal tasks is available for Competition on Legal Information Extraction/Entailment (COLIEE) (Rabelo et al. 2020). However, COLIEE is designed for legal entailment and information retrieval on the Japanese bar exam, and its data size is limited.

Due to the lack of a large reliable dataset, the Japanese LJP research has hardly employed the ML-based approach. Instead, the symbolic approach has been popular for Japanese LJP. The symbolic systems predict outcomes of legal reasoning by rules and logic Nitta et al. (1993); Nitta et al. (1993). Although the symbolic systems require human experts’ intervention in development, we can easily interpret their behaviour. PROLEG Satoh et al. (2010) demonstrated that a logic programming system could work for the Japanese legal system. PROLEG is a legal reasoning system based on Prolog, implementing a decision-making theory used in civil litigation in Japan. However, there still needs to be a solution to extracting logical clauses from natural language text Navas-Loro et al. (2018).

Moreover, in the recent era of rapid socio-economic change, the importance of a type of provision known as general clause is growing in legal practice; the general clauses enable legislators to deal with various unforeseeable situations and to apply the law fairly and appropriately. They do not specify the physical or social facts that must be proven to determine whether specific legal requirements have been met. Instead, they only provide abstractive concepts as requirements. The fulfilment of those requirements will be determined through the comprehensive assessment or evaluation of the relevant individual facts in concrete cases. Determining their fulfilment can hardly be implemented by rule-based or logic programming-based approaches. On the other hand, the ML-based approach can learn the standards of those requirements from many precedents and perform better. Therefore, it is essential to construct a large-scale dataset of Japanese judgment documents to facilitate the ML-based approach for Japanese LJP tasks. We use real Civil Code judgment documents, specifically, torts cases, since the basic rule of torts in the Japanese Civil Code (Art. 709) is a good example of the above-mentioned general clause.

The success of deep learning methods in many areas raises concerns about the lack of explainability (Jacovi and Goldberg 2020) behind their output. LJP outputs are expected to align with Justice, and LJP systems should be trusted by society. Therefore, explainability is more important and serious in the legal domain than in other domains. Even if LJP systems are used as assistant tools for legal consulting, they can affect people’s behaviours and indirectly influence their social status and assets. Thus, the LJP system has to explain the reason for predictions. To accommodate the needs, recent LJP studies introduced explanation tasks, including court view generation Ye et al. (2018), rationale paragraph extraction Chalkidis et al. (2021), and case features extraction as rationales Ferro et al. (2019); Branting et al. (2021). Following the prior work, we design our explanation task as rationale extraction similar to Chalkidis et al. (2021), but our extraction task is at span level instead of paragraph level. In addition, our target of rationale extraction is argumentative claims from the parties (e.g., plaintiff’s factual allegations) instead of submitted fact descriptions.

3 Japanese tort-case dataset (JTD)

3.1 Data source

Our data source of the judgment documents is a legal database “Hanreihisho²” provided by LIC Co., Ltd. We curated judgment documents from the first instances of Civil Code cases from lower courts. We retrieved the documents from the database using keyword-based queries shown in Table 1. We retrieved documents that describe general tort cases of defamation, privacy infringement and reputation injury using query A. As tort disputes on the Internet are often discussed in Disclosure of Identification Information of the Sender (DIIS) cases, we also retrieved DIIS cases using query B.

Table 1

Queries used in our document retrieval. Translations in square brackets are ours

Type	Query
Query A	(“名誉 [fame]” OR “プライバシー [privacy]” OR “信用毀損 [damage to credibility]” ) AND “不法行為 [tort]” NOT “発信者情報開示 [DIIS]” NOT “地位確認 [status confirmation]” NOT “無効確認 [declaration of nullity]” NOT “商標 [trademark]”
Query B	“発信者情報開示 (DIIS)”

As they might include non-tort cases, we manually excluded the irrelevant documents later in the human annotation process.

When writing a judgment document, the judges often comply with a particular guideline for writing civil case judgments Shihō-kenshū-jo (2020). This leads to high similarities in structure across judgments. The structure starts with Main Text briefly describing the final decision, followed by Facts and Reasons containing sections: a summary of the case, undisputed facts, arguments from the parties, and the court’s decisions (detailed judicial decisions).

As our target documents describe court cases, the documents can contain personal information, sensitive information of parties or legally protected information such as trade secrets. Leins et al. (2020) sheds light on potential ethical issues in constructing datasets from a sensitive data source like judgment documents. In the Japanese legal system, it is guaranteed that anyone can access judgment documents by law (Code of Civil Procedure, Art. 91), but the parties may request to opt out of giving others access to their judgment documents (Code of Civil Procedure, Art. 92). Therefore, sensitive secrets should not be contained in the database we use. Moreover, the providers of documents and databases pseudonymise the documents before publishing a case in journals or databases.

3.2 Annotation scheme

To obtain a set of tuples ($U, P, D, T_{g}, R^{P}_{g}, R^{D}_{g}$) for our task inputs and outputs, we annotate the following information based on the work of Yamada et al. (2022). Here, subscripts g denote task answers (golds). Annotators extract spans at the character level according to the following definitions below. Annotators may not identify any spans for a type if there is no corresponding text in a document.

The Court Decisions (CD) span describes a judge’s decision on tort, which should be found in the judicial decision section. The CD span has a Decision (@D) attribute, indicating whether judges affirmed the tort (True) or not (False).

The Claim (CL) span describes important parties’ claims which are relevant to the decision on torts.³ The CL span has two attributes: Accepted (@AC) indicating whether judges accepted the claim (True) or not (False), and Who (@W) indicating their claimant, i.e. one of Plaintiff, Defendant and Other⁴ (e.g., a third party with interest in the outcome).

The Undisputed Facts (UF) fact describes facts that are undisputed by any parties. The UF spans should be identified in sections other than the judicial decision. The UF span has no attribute.

A judgment document might have multiple CD spans. Note that CD spans in a document can have different values since they concern different decisions, though the decisions can be related to each other. The annotators associate each non-CD annotated span (CL and UF spans) with its relevant CD span. A non-CD annotated span can be associated with multiple CD spans.

Our annotation scheme follows Chalkidis et al. (2021), but with more precise annotation granularity. First, we aim to annotate each individual subject (CD span) on trial instead of annotating an entire judgment document, which is a more precise task design in terms of simulating judicial decision-making. Also, our scheme allows annotators to extract rationale spans at the character level instead of the paragraph level. Moreover, our scheme captures argumentative claims from the parties (factual allegations) in addition to submitted fact descriptions. While the annotation of the argumentative claims was implemented in previous work (Galli et al. 2022), ours additionally implements labels indicating whether they are accepted by the court or not.

3.3 Annotation procedure

Annotators get 16 pages of guidelines and five sample annotated documents with commentaries. The guidelines are available as a part of our dataset package. First, the annotators read through a judgment document to understand the argument flow. They discard the document if it does not concern torts. After the screening, the annotators annotate spans (CD, CL and UF) and assign necessary attributes to each span. Because the information of CL and UF spans are used as inputs to a model, they are not annotated in the judicial decision section. The annotators may refer to the judicial decision section only for assigning attribute @D to CD and @AC to CL. Finally, the annotators associate every non-CD annotated span with its corresponding CD span. To balance the workload between annotators, we let annotators skip a document which contains more than 15 CD spans.

3.4 Annotation study

We assessed the reproducibility of the annotation scheme with five annotators, including three lawyers, one law school graduate and one undergraduate. We asked them to annotate 25 documents independently.

As our tasks identify appropriate units in text, which are meaningful and annotatable, in other words, “unitising” tasks, we use Krippendorff’s $\alpha _U$ Krippendorff (1995) as the main metric for the inter-annotator agreement (IAA).⁵ We calculate $\alpha _U$ using the character offset of spans and their “labels”. The span types⁶ are used as a label for the spans, while the attribute values are a label for the attributes. In calculating $\alpha _U$ for the span association, a label for each non-CD annotated span is its associated CD span. As the boundaries of a CD span might be different between annotators, we merged overlapped spans from different annotators into a single CD span taking the union of them.

We observed good agreement overall. The annotators identified 377.4 spans on average from the 25 documents, achieving 0.654 of $\alpha _U$ over all span types. The $\alpha _U$ of attributes are at 0.629 (@AC), 0.641 (@W) and 0.608 (@D) showing reproducible anntotions. $\alpha _U$ of span association shows a reasonable score at 0.430, but it is still lower than that of span extraction due to the error propagation from the span extraction.

3.5 Production annotation

To increase instances, we deployed the annotation scheme to more annotators with legal knowledge and experience. Each document was annotated by an annotator. We qualified annotators through dry-run annotations to maintain annotation quality. As a result, 41 annotators participated in the production annotation. The annotators consist of nine professional lawyers, 22 law school graduates, and 10 undergraduates in law. Graduating from law school or passing the preliminary exam is a prerequisite for taking the Japanese Bar exam. All the undergraduates were going to take the Bar exam, and 8 of them had already passed the preliminary exam when the annotation had started. We also checked whether the annotator complied with the annotation guideline in the middle of and after the annotation period and excluded severe violators. During their annotation, the annotators could ask questions and discuss edge cases with other annotators via a text-based communication workspace.

3.6 Dataset construction

We construct the JTD from the annotated judgment documents (Fig. 3). JTD consists of a set of tuples ($U,P,D,T_g,R^P_g,R^D_g$); we call each an instance. U is made by text annotated as UF. P and D are sequences of text, which are annotated as CL, from the plaintiffs and the defendants, respectively. Their corresponding gold labels for the RE task are $R^P_g$ and $R^D_g$. Each of them is a sequence of Boolean flags imported from attributes @AC. The orders of elements in those sequences correspond to their appearance in the judgment document. $T_g$ is a single Boolean value made by attribute @D which is annotated on a CD span.

Table 2

Japanese Tort-case Dataset overview

# of docs	3477
avg. instances/doc	2.3
# of instances	7978
# of Claims	59,697
# of Undisputed Facts	10,236

Table 3

Dataset split. We split the dataset according to the number of instances

Split	# of docs	# of instances	# of claims
Dev	329	803	6063
Test	391	811	5945
Train	2757	6364	47,689
All	3477	7978	59,697

Table 4

Label (tort or not) distribution of instances

	True	False	All	True rate (%)
Dev	304	499	803	37.9
Test	381	430	811	47.0
Train	2488	3876	6364	39.1
All	3173	4805	7978	39.8

Table 5

Label (accepted or not) distribution of claims

	Overall
	True	False	All	True rate (%)
Dev	2956	3107	6063	48.8
Test	3073	2872	5945	51.7
Train	24,391	23,298	47,689	51.1
All	30,420	29,277	59,697	51.0

Table 6

Label (accepted or not) distribution of claims by parties

	Plaintiff				Defendant
	True	False	All	True rate (%)	True	False	All	True rate (%)
Dev	1494	1530	3024	49.4	1462	1577	3039	48.1
Test	1765	1342	3107	56.8	1308	1530	2838	46.1
Train	12,960	12,271	25,231	51.4	11,431	11,027	22,458	50.9
All	16,219	15,143	31,362	51.7	14,201	14,134	28,335	50.1

Table 7

Length statistics of claims and undisputed facts

	CL		UF
	Plaintiff	Defendant
mean	120.8	111.8	129.5
std	219.6	178.2	171.7
min	2.0	2.0	2.0
25%	63.0	59.0	59.0
50%	94.0	89.0	94.0
75%	140.0	133.0	151.0
99%	501.2	417.0	707.0
max	19893.0	16413.0	7835.0

Mean, standard deviation, and percentiles

Table 2 shows an overview of JTD. 3,477 documents were annotated, which resulted in 7,978 instances. 39.8% of the instances are labelled as True for the T (Table 4). Table 5 shows basic statistics for Claims. In total, 59,697 Claims are available for the rationale extraction task. 51.0% of Claims are labelled as accepted (True), and others as rejected (False). Table 6 gives numbers according to parties (plaintiff or defendant). The rate of true-labelled Claims is consistent across parties, at 51.7% for the plaintiff and 50.1% for the defendant. Out of 59,697 Claims, 52.5% are from the plaintiff side, and others come from the defendant side. Table 7 summarises statistics on the length of Claims and Undisputed Facts. The maximum length of Claims from plaintiff and defendant can be outliers given that the 99th percentiles are much lower than them. Table 3 shows the split of our dataset.

4 Models

To establish an LJP baseline using JTD, we employ a hierarchical Transformer architecture, which achieves competitive performance in various legal NLP tasks Chalkidis et al. (2022). Our hierarchical Transformer-based models (Fig. 4) are designed to capture both word-level context and span-level context by implementing a span-level Transformer encoder on top of word-level Transformer encoders. We call this model Inter-Span Transformer (IST). IST takes a sequence of spans as input, where each span corresponds to a claim from the plaintiff or defendant side. Its outputs are a single Boolean flag for tort judgment prediction and a sequence of Boolean flags for rationale extraction. These final outputs are obtained through a linear layer just after the output from the span-level encoder. IST accepts party-type embeddings to distinguish between the plaintiff’s claims and the defendant’s claims by assigning different embeddings according to who submitted the claim. IST also considers positions of claims in inputs via position embeddings. As an auxiliary input, IST takes undisputed facts. To inject features from undisputed facts into every encoded claim, IST independently encodes all the concatenated undisputed facts into a single vector, namely fact embeddings. An input representation to the span-level encoder, which corresponds to n-th claim by a plaintiff, is the sum of a claim vector from the word-level encoder, fact embeddings ($E^{f}$), party-type embeddings of the plaintiff ($E^{p}$), and position embeddings ($E^{ps}_n$).

As word-level encoders, we utilise two different types of pretrained BERT (Devlin et al. 2019) models, BERT-base-Japanese (BERTja)⁷ and Japanese-LegalBERT (JLBERT) (Miyazaki et al. 2022). Both BERTja and JLBERT use the same model architecture as the original BERT-base model with 12 layers, 768 dimensions of hidden states, and 12 attention heads. We feed a vector corresponding to the [CLS] token from BERT output into a linear layer and use its output as a claim vector. While BERTja is pretrained only with the Japanese Wikipedia corpus, JLBERT is adapted to the Japanese legal domain. JLBERT is firstly pretrained with the Japanese Wikipedia corpus and then further pretrained with Japanese judgment documents in Civil code cases.

Our JTD also uses Japanese judgment documents as the source of the dataset, and we manually checked the number of overlapping documents between the JLBERT pre-training data and the JTD test set. We found 97.4 % of documents from the JTD test in the pre-training data as well. Nevertheless, we decided to employ JLBERT in our experiments because of the following three reasons: (1) JLBERT is the only available pretrained language model in the Japanese legal domain, (2) JLBERT was pretrained on the proprietary judgment document corpus so that we cannot pre-train our version of JLBERT model pretrained without documents in the JTD test set. (3) As our dataset and tasks are produced with manually annotated judgment documents, masked language modelling cannot steal a look into our two LJP tasks. Our JTD test set is still unseen data for BERTja-based models. We can perform a sanity check by conducting the same experiments with BERTja. By comparing BERTja-based and JLBERT-based models, we avoid overestimating the performance of JLBERT-based models and provide an informative oversight of how the domain-adopted models behave in the Japanese legal judgment tasks.

4.1 Rationale extraction (RE)

Rationale Extraction task identifies accepted arguments in the parties’ arguments. Inputs are undisputed facts (U) and arguments from both parties (P from plaintiffs and D from defendants). Outputs are two sequences of Boolean values, $R^P$ for plaintiffs and $R^D$ for defendants, denoting accepted arguments as True.

The models used for RE are as follows.

RE-random: A random baseline that produces prediction based on the ratio of labels in the training set.
RE-BERT: A BERT-based binary classifier. It classifies each input claim independently without regarding context across claims. There are two variants: RE-BERTja and RE-JLBERT depending on the BERT model used.
RE-IST: An IST-based classifier. It outputs a sequence of Boolean values for RE, taking a sequence of claims from plaintiffs and defendants as inputs. There are RE-IST-BERTja and RE-IST-JLBERT.

4.2 Tort prediction (TP)

Tort Prediction predicts whether a tort is affirmed (T, a Boolean value), given undisputed facts (U) and arguments from both parties (P and D) as inputs.

The following are models used for TP.

TP-random: A random baseline, similar to RE-random.
TP-RF-meta: A randomforest (Breiman 2001) classifier. This model takes non-textual features as inputs: the year of a case, the court to which a case belongs and the number of claims from each party.⁸ This baseline shows how well a model can predict court outcomes using meta-level features.
TP-RF-gold: A similar model to TP-RF-meta, but this model uses the acceptance rates of each claim type instead of their number. The acceptance rates are calculated according to the gold labels of the rationale extraction task in JTD. Note that the target task here is TP, and this model is still blind against the gold labels of TP during validation and testing. This baseline provides a milestone for the TP task.
TP-RF-cascaded: A classifier similar to TP-RF-gold. It uses the predicted rationales instead of the golds to calculate acceptance rates.
TP-IST: An IST-based classifier. It outputs a Boolean value for TP, taking all claims from plaintiffs and defendants as inputs. This model also has two variants: TP-IST-BERTja and TP-IST-JLBERT.
TP-IST-gold: An IST-based classifier whose input is only accepted gold claims. This is an IST version of TP-RF-gold. There are TP-IST-gold-BERTja and TP-IST-gold-JLBERT.
TP-IST-cascaded: Its architecture is identical to TP-IST-gold. This model takes only predicted rationales by RE as inputs.

4.3 Multi-task approach

We also take a multi-task approach with the IST architecture, in which the model learns its parameters jointly for both RE and TP tasks using the combined loss function (1).

$$\begin{aligned} \textrm{Loss} = {\alpha }\textrm{Loss}_\textrm{TP} + {(1 - \alpha )}\textrm{Loss}_\textrm{RE} \end{aligned}$$

(1)

The multi-task IST takes a sequence of claims and outputs a sequence of Boolean predictions for RE and a Boolean value for TP. There are two variants, Multi-IST-BERTja and Multi-IST-JLBERT.

5 Experiments

5.1 Experimental settings

In the training phase of all neural network-based models, we used the AdamW (Loshchilov and Hutter 2019) optimiser with a linear scheduler whose warmup step was 10% of total training steps. Epochs were fixed at 30. The maximum length of the word-level encoders is 512. IST models accept up to 64 claims. We chose the best-performing checkpoint according to accuracy scores in the development set. The parameters of word-level encoders (BERTja and JLBERT) are not frozen and fine-tuned.

We employed an optimisation framework Optuna (Akiba et al. 2019) to search optimal hyperparameters for each neural network-based model. All hyperparameters were tuned using 3,000 instances of the training data. For the RE-BERT models, we performed a grid search to find the optimal learning rate where the total number of trials was four. For the IST models, we conducted more extensive searches. We utilised the Tree-structured Parzen Estimator algorithm (Bergstra et al. 2011) for IST as their search spaces were much larger than RE-BERT’s. The total number of trials was 112. Table 8 shows their search spaces. The “TRenc” hyperparameters are for the Transformer module implemented as the span-level encoder in the IST. “Use UF” is a flag whether a model utilises fact embeddings or not. $\alpha$ is a weight of the loss function (Eq (1)) for the multi-task models.

Table 8

Hyperparameters search space

Parameters	Choices
Learning rate	2e−6, 4e−6, 6e−6, 8e−6
TRenc heads	2, 4, 6, 8
TRenc FF dim	64, 128, 256, 512
TRenc layers	1, 2, 3, 4
Use UF	True, False
$\alpha$ (if applicable)	0, 0.05,..., 0.95

We adopt accuracy for the evaluation metrics for both RE and TP. We trained and tested each model five times with different random seeds and averaged the scores. We also performed the permutation test to assess the statistical significance between models (significance level is at $p<0.05$, two-tailed test). The target metric of the permutation test is the accuracy score.

5.1.1 Cascaded model settings

In the cascaded models, an RE model first predicts claims to be accepted, which are then fed to a TP model. We employ the outputs from the best-performing RE model. To be concrete, we chose Multi-IST-JLBERT and Multi-IST-BERTja for the RE task in the cascaded model. We notate the cascaded models using these RE models as follows: TP-RF-cascaded-BERTja, TP-RF-cascaded-JLBERT, TP-IST-cascaded-BERTja, and TP-IST-cascaded-JLBERT.

5.2 Results

Table 9

Experimental results of RE (accuracy with standard deviation)

	Claim-level			Doc-level
Models	All	Plaintiff	Defendant
RE-random	0.498 (.005)	0.496 (.005)	0.501 (.007)	0.502 (.007)
RE-BERT-BERTja	0.598 (.005)	0.597 (.007)	0.598 (.008)	0.622 (.003)
RE-BERT-JLBERT	0.634 (.005)	0.635 (.009)	0.631 (.006)	0.653 (.006)
RE-IST-BERTja	0.637 (.012)	0.652 (.010)	0.620 (.022)	0.658 (.014)
RE-IST-JLBERT	0.663 (.008)	0.677 (.013)	0.648 (.008)	0.681 (.007)
Multi-IST-BERTja	0.666 (.008)	0.671 (.009)	0.661 (.011)	0.690 (.013)
Multi-IST-JLBERT	0.674 (.009)	0.675 (.007)	0.673 (.014)	0.691 (.005)

Table 10

Experimental results of TP (accuracy with standard deviation)

Models	Macro avg. ($\sigma$)
TP-RF-gold	0.880 (.001)
TP-IST-gold-BERTja	0.883 (.005)
TP-IST-gold-JLBERT	0.883 (.009)
TP-random	0.503 (.014)
TP-RF-meta	0.574 (.006)
TP-IST-BERTja	0.649 (.023)
TP-IST-JLBERT	0.674 (.024)
Multi-IST-BERTja	0.680 (.007)
Multi-IST-JLBERT	0.683 (.020)
TP-RF-cascaded-BERTja	0.660 (.030)
TP-RF-cascaded-JLBERT	0.639 (.012)
TP-IST-cascaded-BERTja	0.673 (.022)
TP-IST-cascaded-JLBERT	0.666 (.013)

5.2.1 Rationale extraction

Table 9 shows experimental results for the RE task. The best-performing scores among the compared models are highlighted in boldface. The “All” column shows accuracy scores calculated on all claims. We use “All” scores as the target metric for the permutation test. The “Plaintiff” and “Defendant” columns show accuracy scores of claims from plaintiff and defendant, respectively. The “Doc-level” column shows accuracy scores at the document level, where the scores are first calculated per document and then averaged over documents.

According to “All” scores, IST showed significant improvement from RE-BERT; capturing the context between input claims helped the task. Comparing BERTja and JLBERT, we find the model using JLBERT always performs numerically better than that with BERTja, confirming the statistically significant difference except for pairs of multi-task approaches. The multi-task models (Multi-IST-$*$) always showed better accuracy than their corresponding single-task models (RE-IST-$*$). The improvement by the multi-task model is more significant when using BERTja than JLBERT. The improvement from RE-IST-BERTja to Multi-IST-BERTja is statistically significant.

When we see scores according to parties, the overall trend is the same as “All” scores. An exception is “Plaintiff” where RE-IST-JLBERT achieved the best score among all the models.

The “Doc-level” scores share the same trend with “All”.

5.2.2 Tort prediction

Table 10 shows experimental results for the TP task. The best score among non-gold models is highlighted in boldface. All TP-IST and Multi-IST models were significantly better than TP-RF-meta in accuracy. Gold models ($*$-gold-$*$) showed approximately 0.88 in accuracy, suggesting an upper bound of the TP models given the perfect RE output.

The multi-task models showed the best performance in TP as well as RE. The multi-task models are always numerically better than their corresponding single-task models in accuracy, and we observed statistical significance between Multi-IST-BERTja and TP-IST-BERTja.

The cascaded models showed improvement from their corresponding single-task models; however, even the best model (TP-IST-cascaded-BERTja) did not come close to the Multi-IST models.

5.3 Discussion

In both tasks, the multi-task models show the best performance, suggesting they could leverage interaction between the tort judgment and its supporting arguments. This result is aligned with legal experts’ behaviour in interpreting legal cases. They do not simply conclude in a bottom-up manner. Rather, they make inferences moving between consideration of subordinate arguments and the conclusion. We assume the multi-task models that jointly learned both TP and RE could model the legal prediction better.

The best accuracy we achieved in the TP task was 0.683 with JLBERT as a word-level encoder and 0.680 with BERTja. Those scores are far from perfect and still much lower than 0.880, which is the milestone achieved by the gold models. Our scores are comparable with an accuracy score of 0.668 reported in the class action LJP task from the United States (Semo et al. 2022).⁹

The high accuracy of the gold TP models indicates the importance of RE for TP. The best RE accuracy was 0.674 with JLBERT and 0.666 with BERTja. There is a big room to be improved from those baseline models yet. JTD implements claim-level rationale extraction for LJP while previous work employed paragraph-level (Chalkidis et al. 2021). Challenges of JTD stem from finer-grained targets to be classified where more precise inference is required.

In the RE task, we observed that IST models ($*$-IST-$*$) showed higher accuracy for the plaintiff’s claims than the defendant’s claim. Differences in the number of instances between the plaintiff and the defendant could have caused this result. JTD has more claims from the plaintiff (52.5%) than the defendant (47.5%) (Table 6).

JLBERT is generally superior to BERTja for the same architecture for both tasks except for the cascaded models. However, the difference is small for the best models (Multi-IST-$*$). As our tasks are specific to the legal domain and JLBERT was trained by the judgment documents that are the source of JTD, we expected better performance from the JLBERT-based models. In reality, however, the effectiveness of JLBERT was limited. This indicates that the span-level encoder and its following layers play more important roles than the word-level encoders in our tasks.

6 Error analysis

We conducted a detailed error analysis employing human legal experts on the tort prediction outputs to identify the source of prediction errors.

6.1 Analysis setup

Four experts (the authors of this paper) participated in the analysis. They are all legal professionals, including three professors of Law at Japanese universities and a Japanese lawyer who has eight years of experience.

We consider the outputs from the well-performed four models: TP-IST-BERTja, TP-IST-JLBERT, Multi-IST-BERTja and Multi-IST-JLBERT. We merged the TP outputs by majority voting from five runs of each model. We analysed instances where all four models failed in prediction. We obtained 139 out of 811 instances from our test set. Due to limited human resources, we randomly selected 60 from 139 instances. Two of the experts got 20 instances each, and the other two got 10 each. The experts were asked to fill out a form given the same inputs as the models. They were allowed to see the original judgment documents, if necessary. There are two major questions in the form.

6.1.1 Human confidence score

The experts were asked to rate how confidently they could make tort predictions with the given input. The score scale is 0, 1, 2, 3, where 3 means a human legal expert can confidently judge whether an instance is a tort or not, 2 means a human legal expert can predict with uncertainty, 1 means a human legal expert can only predict a tendency of its outcome and 0 means even a human legal expert cannot predict its outcome at all.

6.1.2 External knowledge

We hypothesised that a certain type of error occurred because of missing necessary information in the input texts, i.e., legal domain-specific knowledge and facts described in an external document other than a judgment document. We asked the experts to identify what kind of external knowledge was necessary to make a correct prediction. There are four options, “General knowledge”, “Legal knowledge”, “Insufficient input”, and “Other”. “Insufficient input” is chosen when an expert believes essential information specific to the instance is missing from input claims or undisputed facts, while “General knowledge” and ”Legal knowledge” would be chosen when missing information is independent of the instance. “Other” is chosen when the above three do not fit. The necessary information is detailed for the “Other” option in the free text. The experts could choose multiple options from the four.

In addition to the two questions above, the experts could submit their feedback and comments.

6.2 Result

Figure 5 shows human confidence scores on 60 instances. There are 23 instances in which even human experts cannot predict their outcomes at all, while the experts make only four confident predictions. Scores 0 and 1 make more than 70% of the analysed instances, which is considered unconfident. The results suggest those wrongly predicted instances by models are also difficult for a human legal expert.

Table 11

Missing information for the TP task

Missing information	#instances	(%)
General knowledge	2	3.3
Legal knowledge	15	25.0
Insufficient inputs	35	58.3
Other	17	28.3

52 out of 60 instances are flagged with one or more types of external knowledge. Table 11 shows the distribution of instances across the missing information types. Only two instances require “General knowledge”. One of them is about a sender’s information disclosure, which requires missing information “the same IP address does not necessarily mean that the same person posted the messages”.

The experts found 15 instances require “Legal knowledge” to predict their outcome correctly. The legal knowledge includes specific requirements of certain laws, civil code procedures, and heuristics knowledge, which a legal expert can obtain through their experience.

The dominant category was “Insufficient inputs” (35 instances), which stem from the nature of the judgment document format. Judgment documents often refer to external documents for detailed evidence. In such cases, descriptions and claims about the evidence in the judgment document itself may not be specific. Often, those external documents are not publicly available. Thus, the annotated claims and undisputed facts from the judgment document might be insufficient for confident prediction. Also, in some judgments, concrete descriptions of evidence or even factual claims from the parties can be pseudonymised or censored to protect private and confidential information.

Another cause of insufficient inputs comes from the annotation process. For example, an annotator failed to extract necessary claims and facts during the annotation process. Our annotation guideline prohibits, in principle, the extraction of claims and facts from the “court’s decision” section, which contains information corresponding to the answer in the TP task 3.1. Therefore, the necessary facts were not fully annotated when they were described only in the “court’s decision” section.

Eight out of 60 instances are not flagged with any external knowledge. However, five are rated with a confidence score of 1, suggesting that they are not necessarily easy to predict. The experts reported that those instances were from controversial cases, including a case whose decision was flipped in a higher court. Such a case is considered challenging for a machine.

Many comments and feedback from the experts are on instances extracted from complicated cases with multiple tortious actions to be judged. Our annotation guideline instructs annotators to distinguish one tortious action from others. When there are many tortious actions in the same documents, there are many related claims and facts, so argumentative relations between them are complicated. For example, a single claim can be related to multiple tortious actions. Thus, an annotator can miss necessary facts or fail to exclude unnecessary claims in the instance, resulting in instances with insufficient inputs. Moreover, even if instances in such cases are annotated perfectly, they are still challenging to predict their outcomes because of the tangled argumentative relations between claims.

Counterclaims make instances more complicated, which allows defendants to assert a new claim against plaintiffs during the same lawsuit that the plaintiffs initially filed. The experts identified three out of 60 instances where counter-claims made the instances difficult. In annotating counterclaim instances, our annotation guideline explicitly instructed annotators to extract the original plaintiffs’ claims as defendant claims and the original defendants’ claims as plaintiff claims to make argumentative relations between claims consistent with non-counterclaim instances. Nevertheless, we found the counterclaim instances still confusing to a machine. For example, the counterclaim’s plaintiff was expressed with “defendant” in the input and vice versa. Although these notations were correct for a counterclaim instance, they can confuse a machine since most instances are non-counterclaim.

Another remarkable feedback is about instances from medical cases. In those cases, detailed medical procedures and expert opinions, which are crucial to making a legal decision, may be prepared separately from the main judgment documents. As a result, our annotators could not extract the necessary information for such instances. Therefore, these medical cases can be difficult to predict.

7 Limitations and ethical considerations

We clarify limitations and ethical considerations here to prevent misuse of our dataset and misunderstanding of our findings.

The task design for LJP reflects essential elements of the Japanese jurisdiction. However, there are differences from real-world conditions. The inputs in our tasks are only Undisputed Facts, Plaintiff’s Claims, and Defendant’s Claims, which are obtained from judgment documents. In real court cases, there are other documents, including third-party expert opinion letters, detailed evidence documents, and other undisclosed documents (e.g., private information and confidential patent information). They are intermediate outputs or auxiliary inputs in court cases but are still important. This difference limits the model’s capability, as shown by manual error analysis results 6.2. The difference often stems from the limited scope of publicly available information. We suspect such limitation also applies to other jurisdictions that do not disclose the whole court document sets. As many LJP datasets are constructed from only judgment documents, we must recognise that the current LJP task design can be limited and reflect only a part of legal decision-making.

The annotation scheme for our dataset was validated through the preliminary experiment with 25 documents annotated by five legal experts. We note that the number of documents used in the preliminary experiment is lower than that of our final dataset, i.e. 3477 documents, and we did not perform dual-coder annotation for the final dataset. Although we took alternative measures (extended tutorials, and annotator selection via dry-run) to ensure the quality of annotation, we acknowledge that the final dataset may contain inevitable human errors.

We also acknowledge that our dataset is limited in its scope. We must point out that the dataset contains tort-related cases, which is a part of the whole legal judgement in the Japanese Civil jurisdiction. Thus, our findings from the experimental results cannot directly apply to the general Japanese LJP tasks. However, we believe our baseline results provide informative clues to other civil legal judgement tasks. Another limitation is the quality of the annotations. While we have made an effort to keep the annotation quality as high as possible, there is still a chance of errors and misunderstandings by the annotators. Although the agreement study shows reasonable performance, we plan to update and expand our JTD on a long-term schedule continuously.

Given the limitations, we emphasise that one should not rely solely on a model trained on JTD in one’s legal decision-making. Our recommended use case of the JTD-trained models is legal assistance services, but they should not be fully automated. We recommend using the models with legal professionals such as lawyers¹⁰. This human–machine hybrid approach will improve the lead time of case handling and correct potential errors from models’ false predictions.

Moreover, we emphasise that JTD is not intended to develop a judgment system. In other words, we did not intend to replace judges or courts with JTD-trained models. Important intellectual activities of judges include the conceptualisation of new rights and the update of law interpretations in response to age. However, the current LJP tasks in JTD do not cover such aspects. Therefore, we argue only human judges should take responsibility for such authority.

8 Conclusion and future work

We proposed a novel dataset for the Japanese legal judgment prediction featuring tort cases under the Japanese Civil Code. We specifically targeted two tasks: tort prediction and rational extraction. This is the first dataset consisting of real Japanese court cases with human expert annotation. Although we still have to continue to improve the annotation quality, our annotation procedure will be a good starting point to produce reliable datasets in the Japanese legal domain.

We conducted a feasibility study of the two tasks with the baseline models. We also compared the performance between single-task and multi-task approaches. The baseline experiments confirmed the feasibility of our tasks, and we found that the multi-task approach performed better. Moreover, we manually analysed the outputs from the IST models with experienced legal specialists to diagnose the errors. The results suggest that there are difficult cases even for human experts.

Our dataset and extended expert analysis only made these novel findings possible. We believe our dataset provides a useful resource in legal NLP for not just Japanese LJP research but also comparative experiments and analysis across different languages and jurisdictions.

In future work, we plan to extend the size of our dataset and increase the types of cases in addition to tort cases. Additionally, adapting our annotation scheme and tasks to other jurisdictions is an interesting direction for further research.

9 Supplementary information

Our JTD dataset is available for non-commercial academic research purposes. We also share the original document set used in JTD construction. All the files are available as “Japanese Tort-case Dataset” at https://www.gsk.or.jp/catalog/gsk2024-a.

Acknowledgements

We appreciate Prof. Souichirou Kozuka at Gakushuin University and Prof. Kazuhiko Yamamoto at Hitotsubashi University for their helpful comments. The judgment documents data for this study was provided at no cost by LIC Co., Ltd. solely for academic research purposes. This work was supported by JST RISTEX Grant Number JPMJRX19H3, JST ACT-X Grant Number JPMJAX20AM and Support Centre for Advanced Telecommunication Technology Research.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix A

Prediction examples

Table 12 shows a prediction example. Translations are ours. A sentence (claim) in Japanese might be separated into several sentences in English as a result of translation. The sample was chosen from instances in which a human expert commented that a judicial decision prediction would be difficult.

Table 12

Prediction sample (ID: L07530809-6654)

		RE labels
Party	Claim	Gold	Pred.
Plaintiff	The defendants requested a loan from the plaintiff around July 2018, stating that “the development of an accounting system using AI is close to being completed, but we do not have enough funds”. In response to this request, the plaintiff concluded the agreement on 30th July 2018 and granted a total of JPY 10 million on three occasions between 30th July and 25th September 2018. However, the defendants had not developed the AI accounting software at all.	True	False
Plaintiff	The defendants did not develop any AI accounting software. Despite having no intention of using the funds borrowed from the plaintiff for the defendant company’s business or repaying them, they pretended as if they had such intentions and fraudulently obtained money from the plaintiff. Such fraudulent acts by the defendants constitute tortious conduct against the plaintiff.	True	False
Defendant	There is no fact that the defendants asked the plaintiff for the loan, and the agreement was initially triggered by the plaintiff’s offer to purchase shares of the defendant company for 20 million yen as the plaintiff had made several hundred million yen in profits from the plaintiff’s business. Subsequently, due to the plaintiff’s circumstances, the plaintiff agreed to lend the defendant 10 million yen.	False	True
Defendant	The defendants are developing AI accounting software, and moreover, under the agreement, the use of the loans was only specified as “operating funds” and not limited to the cost of developing AI accounting software. Thus, there was neither a breach of the agreement by the defendants nor tortious conduct.	False	True

TP Gold: True, TP Prediction: False

Yamamoto (2019) provides a good overview of the Japanese torts.

https://www.hanreihisho.com/.

In reality, we annotated two subcategories of CL: Factual Claims (FC) and Claims of Norms (NC); the former includes factual allegations and their opposing fact assertions, while the latter refers to abstract legal arguments regarding torts (e.g., references to precedents from the supreme court).

We did not consider Other in our later experiments as there were few.

We used the implementation by Meyer et al. (2014).

Four span types: CD, UF, NC and FC. Note that NC and FC are the subcategories of CL.

https://huggingface.co/cl-tohoku/bert-base-japanese-v2.

We actually distinguish Factual Claims and Claims of Norms for the features.

Semo et al. (2022) works on similar tasks to our TP task. However, many differences still exist, such as languages, jurisdictions, target topics, and legal proceedings. They may affect the outcome of experiments. Thus, this comparison is only for advisory.

As of the paper submission, non-lawyers are prohibited from engaging in any legal services for the purpose of earning compensation, which are exclusively licensed lawyers in Japan. (Attorneys Act, Art. 72)

Akiba T, Sano S, Yanase T, Ohta T, Koyama M (2019) Optuna: a next-generation hyperparameter optimization framework. In: Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining. KDD ’19, pp. 2623–2631. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/3292500.3330701

Aletras Nikolaos, Tsarapatsanis Dimitrios, Preoţiuc-Pietro Daniel, Lampos Vasileios (2016) Predicting judicial decisions of the European court of human rights: a natural language processing perspective. Peer J Comput Sci 2:e93. https://doi.org/10.7717/peerj-cs.93CrossRef

Bergstra J, Bardenet R, Bengio Y, Kégl B (2011) Algorithms for hyper-parameter optimization. In: Shawe-Taylor, J., Zemel, R.S., Bartlett, P.L., Pereira, F.C.N., Weinberger, K.Q. (eds.) Advances in Neural Information processing systems 24: 25th annual conference on neural information processing systems 2011. Proceedings of a Meeting Held 12-14 December 2011, Granada, Spain, pp. 2546–2554. https://proceedings.neurips.cc/paper/2011/hash/86e8f7ab32cfd12577bc2619bc635690-Abstract.html

Branting LK, Pfeifer C, Brown B, Ferro L, Aberdeen JS, Weiss B, Pfaff M, Liao B (2021) Scalable and explainable legal prediction. Artif Intell Law 29(2):213–238. https://doi.org/10.1007/s10506-020-09273-1CrossRef

Breiman L (2001) Random forests. Mach Learn 45(1):5–32. https://doi.org/10.1023/A:1010933404324CrossRef

Chalkidis I, Androutsopoulos I, Aletras N (2019) Neural legal judgment prediction in English. In: Proceedings of the 57th Annual meeting of the association for computational linguistics, pp. 4317–4323. Association for computational linguistics, florence, Italy. https://doi.org/10.18653/v1/P19-1424

Chalkidis I, Fergadiotis M, Tsarapatsanis D, Aletras N, Androutsopoulos I, Malakasiotis P (2021) Paragraph-level rationale extraction through regularization: a case study on European court of human rights cases. In: Proceedings of the 2021 conference of the North American chapter of the association for computational linguistics: human language technologies, pp 226–241. Association for computational linguistics, Online. https://doi.org/10.18653/v1/2021.naacl-main.22. https://aclanthology.org/2021.naacl-main.22

Chalkidis I, Jana A, Hartung D, Bommarito M, Androutsopoulos I, Katz D, Aletras N (2022) LexGLUE: A benchmark dataset for legal language understanding in English. In: Proceedings of the 60th annual meeting of the association for computational Linguistics (Volume 1: Long Papers), pp. 4310–4330. Association for Computational Linguistics, Dublin, Ireland. https://doi.org/10.18653/v1/2022.acl-long.297 . https://aclanthology.org/2022.acl-long.297

Devlin J, Chang M-W, Lee K, Toutanova K (2019) BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, Volume 1 (Long and Short Papers), pp 4171–4186. Association for Computational Linguistics, Minneapolis, Minnesota. https://doi.org/10.18653/v1/N19-1423. https://aclanthology.org/N19-1423

Ferro L, Aberdeen J, Branting K, Pfeifer C, Yeh A, Chakraborty A (2019) Scalable methods for annotating legal-decision corpora. In: Proceedings of the natural legal language processing workshop 2019, pp 12–20. Association for Computational Linguistics, Minneapolis, Minnesota. https://doi.org/10.18653/v1/W19-2202. https://aclanthology.org/W19-2202

Galli F, Grundler G, Fidelangeli A, Galassi A, Lagioia F, Palmieri E, Ruggeri F, Sartor G, Torroni P (2022) Predicting outcomes of italian VAT decisions. In: Francesconi, E., Borges, G., Sorge, C. (eds.) Legal knowledge and information systems - JURIX 2022: the thirty-fifth annual conference, Saarbrücken, Germany, 14-16 December 2022. Frontiers in artificial intelligence and applications, vol. 362, pp. 188–193. IOS Press, Germany. https://doi.org/10.3233/FAIA220465

Hu Z, Li X, Tu C, Liu Z, Sun M (2018) Few-shot charge prediction with discriminative legal attributes. In: Proceedings of the 27th international conference on computational linguistics, pp. 487–498. Association for computational linguistics, Santa Fe, New Mexico, USA. https://aclanthology.org/C18-1041

Jacovi A, Goldberg Y (2020) Towards faithfully interpretable NLP systems: how should we define and evaluate faithfulness? In: Proceedings of the 58th annual meeting of the association for computational linguistics, pp 4198–4205. Association for Computational Linguistics, Online. https://doi.org/10.18653/v1/2020.acl-main.386. https://aclanthology.org/2020.acl-main.386

Katz DM, Bommarito MJ II, Blackman J (2017) A general approach for predicting the behavior of the supreme court of the united states. PLoS ONE 12(4):1–18. https://doi.org/10.1371/journal.pone.0174698CrossRef

Krippendorff K (1995) On the reliability of unitizing continuous data. Sociol Methodol 25:47–76CrossRef

Leins K, Lau JH, Baldwin T (2020) Give me convenience and give her death: who should decide what uses of NLP are appropriate, and on what basis? In: Proceedings of the 58th annual meeting of the association for computational linguistics, pp 2908–2913. Association for Computational Linguistics, Online. https://doi.org/10.18653/v1/2020.acl-main.261. https://aclanthology.org/2020.acl-main.261

Long S, Tu C, Liu Z, Sun M (2019) Automatic judgment prediction via legal reading comprehension. In: Sun, M., Huang, X., Ji, H., Liu, Z., Liu, Y. (eds.) Chinese computational linguistics-18th China National Conference, CCL 2019, Kunming, China, October 18-20, 2019, Proceedings. Lecture Notes in Computer Science, vol. 11856, pp. 558–572. Springer, China. https://doi.org/10.1007/978-3-030-32381-3_45

Loshchilov I, Hutter F (2019) Decoupled weight decay regularization. In: 7th International conference on learning representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, USA. https://openreview.net/forum?id=Bkg6RiCqY7

Luo B, Feng Y, Xu J, Zhang X, Zhao D (2017) Learning to predict charges for criminal cases with legal basis. In: Proceedings of the 2017 conference on empirical methods in natural language processing, pp. 2727–2736. Association for Computational Linguistics, Copenhagen, Denmark. https://doi.org/10.18653/v1/D17-1289 . https://aclanthology.org/D17-1289

Medvedeva M, Vols M, Wieling M (2018) Judicial decisions of the European Court of Human Rights: looking into the crystall ball. In: Proceedings of the Conference on Empirical Legal Studies in Europe 2018

Meyer CM, Mieskes M, Stab C, Gurevych I (2014) DKPro agreement: an open-source Java library for measuring inter-rater agreement. In: Proceedings of COLING 2014, the 25th International conference on computational linguistics: system demonstrations, pp 105–109. Dublin City University and Association for Computational Linguistics, Dublin, Ireland. https://aclanthology.org/C14-2023

Miyazaki K, Yamada H, Tokunaga T (2022) Cross-domain analysis on Japanese legal pretrained language models. In: Findings of the association for computational linguistics: AACL-IJCNLP 2022, pp. 274–281. Association for Computational Linguistics, Online only. https://aclanthology.org/2022.findings-aacl.26

Navas-Loro M, Satoh K, Rodríguez-Doncel V (2018) ContractFrames: bridging the gap between natural language and logics in contract law. In: New frontiers in artificial intelligence: JSAI-IsAI 2018 workshops, JURISIN, AI-Biz, SKL, LENLS, IDAA, Yokohama, Japan, November 12-14, 2018, Revised Selected Papers, pp. 101–114. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-030-31605-1_9

Nitta K, Ohtake Y, Maeda S, Ono M, Ohsaki H, Sakane K (1993) HELIC-II: legal reasoning system on the parallel inference machine. New Gen. Comput. 11(3–4):423–448. https://doi.org/10.1007/BF03037186CrossRef

Nitta K, Wong S, Ohtake Y (1993) A computational model for trial reasoning. In: Proceedings of the 4th international conference on artificial intelligence and law. ICAIL ’93, pp. 20–29. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/158976.158979

Rabelo J, Kim M, Goebel R, Yoshioka M, Kano Y, Satoh K (2020) COLIEE 2020: Methods for legal document retrieval and entailment. In: Okazaki, N., Yada, K., Satoh, K., Mineshima, K. (eds.) New frontiers in artificial intelligence - JSAI-isAI 2020 Workshops, JURISIN, LENLS 2020 Workshops, Virtual Event, November 15-17, 2020, Revised Selected Papers. Lecture Notes in Computer Science, vol. 12758, pp. 196–210. Springer, Online. https://doi.org/10.1007/978-3-030-79942-7_13

Satoh K, Asai K, Kogawa T, Kubota M, Nakamura M, Nishigai Y, Shirakawa K, Takano C (2010) PROLEG: An implementation of the presupposed ultimate fact theory of Japanese Civil code by PROLOG technology. In: Onada, T., Bekki, D., McCready, E. (eds.) New frontiers in artificial intelligence - JSAI-isAI 2010 Workshops, LENLS, JURISIN, AMBN, ISS, Tokyo, Japan, November 18-19, 2010, Revised Selected Papers. Lecture Notes in Computer Science, vol. 6797, pp. 153–164. Springer, Japan. https://doi.org/10.1007/978-3-642-25655-4_14

Semo G, Bernsohn D, Hagag B, Hayat G, Niklaus J (2022) ClassActionPrediction: A challenging benchmark for legal judgment prediction of class action cases in the US. In: Proceedings of the natural legal language processing workshop 2022, pp. 31–46. Association for computational linguistics, Abu Dhabi, United Arab Emirates (Hybrid). https://doi.org/10.18653/v1/2022.nllp-1.3 . https://aclanthology.org/2022.nllp-1.3

Shihō-kenshū-jo [The legal research and training institute]: Minji-hanketsu-kian-no-tebiki (Hotei-ban) [The Guide to Write Civil Judgements (revised Version)], 10th edn. Housou-kai, Japan (2020)

Valvoda J, Cotterell R, Teufel S (2023) On the role of negative precedent in legal outcome prediction. Trans Assoc Comput Linguist 11:34–48. https://doi.org/10.1162/tacl_a_00532CrossRef

Xiao C, Zhong H, Guo Z, Tu C, Liu Z, Sun M, Feng Y, Han X, Hu Z, Wang H, Xu J (2018) CAIL2018: a large-scale legal dataset for judgment prediction. CoRR arXiv:1807.02478

Xu N, Wang P, Chen L, Pan L, Wang X, Zhao J (2020) Distinguish confusing law articles for legal judgment prediction. In: Proceedings of the 58th annual meeting of the association for computational linguistics, pp. 3086–3095. Association for computational linguistics, Online. https://doi.org/10.18653/v1/2020.acl-main.280 . https://aclanthology.org/2020.acl-main.280

Yamada H, Tokunaga T, Ohara R, Takeshita K, Sumida M (2022) Annotation study of Japanese judgments on tort for legal judgment prediction with rationales. In: Proceedings of the thirteenth language resources and evaluation conference, pp. 779–790. European Language Resources Association, Marseille, France. https://aclanthology.org/2022.lrec-1.83

Yamamoto K (2019) Basic features of Japanese Tort Law. Civil law. Jan Sramek Verlag KG, Wien, Austria

Ye H, Jiang X, Luo Z, Chao W (2018) Interpretable charge predictions for criminal cases: Learning to generate court views from fact descriptions. In: Proceedings of the 2018 conference of the North American chapter of the association for computational linguistics: human language technologies, vol 1 (Long Papers), pp 1854–1864. Association for computational linguistics, New Orleans, Louisiana. https://doi.org/10.18653/v1/N18-1168. https://aclanthology.org/N18-1168

Zhong H, Guo Z, Tu C, Xiao C, Liu Z, Sun M (2018) Legal judgment prediction via topological learning. In: Proceedings of the 2018 conference on empirical methods in natural language processing, pp. 3540–3549. Association for computational linguistics, Brussels, Belgium. https://doi.org/10.18653/v1/D18-1390 . https://aclanthology.org/D18-1390

Titel: Japanese tort-case dataset for rationale-supported legal judgment prediction
verfasst von: Hiroaki Yamada
Takenobu Tokunaga
Ryutaro Ohara
Akira Tokutsu
Keisuke Takeshita
Mihoko Sumida
Publikationsdatum: 11.05.2024
Verlag: Springer Netherlands
Erschienen in: Artificial Intelligence and Law
Print ISSN: 0924-8463
Elektronische ISSN: 1572-8382
DOI: https://doi.org/10.1007/s10506-024-09402-0

Springer Professional

Japanese tort-case dataset for rationale-supported legal judgment prediction

Abstract

Publisher's Note

1 Introduction

3 Japanese tort-case dataset (JTD)

3.1 Data source

3.2 Annotation scheme

3.3 Annotation procedure

3.4 Annotation study

3.5 Production annotation

3.6 Dataset construction

4 Models

4.1 Rationale extraction (RE)

4.2 Tort prediction (TP)

4.3 Multi-task approach

5 Experiments

5.1 Experimental settings

5.1.1 Cascaded model settings

5.2 Results

5.2.1 Rationale extraction

5.2.2 Tort prediction

5.3 Discussion

6 Error analysis

6.1 Analysis setup

6.1.1 Human confidence score

6.1.2 External knowledge

6.2 Result

7 Limitations and ethical considerations

8 Conclusion and future work

9 Supplementary information

Acknowledgements

Publisher's Note

Appendix A

Prediction examples

Premium Partner

Springer Professional

Abstract

Publisher's Note

1 Introduction

2 Related work and background

3 Japanese tort-case dataset (JTD)

3.1 Data source

3.2 Annotation scheme

3.3 Annotation procedure

3.4 Annotation study

3.5 Production annotation

3.6 Dataset construction

4 Models

4.1 Rationale extraction (RE)

4.2 Tort prediction (TP)

4.3 Multi-task approach

5 Experiments

5.1 Experimental settings

5.1.1 Cascaded model settings

5.2 Results

5.2.1 Rationale extraction

5.2.2 Tort prediction

5.3 Discussion

6 Error analysis

6.1 Analysis setup

6.1.1 Human confidence score

6.1.2 External knowledge

6.2 Result

7 Limitations and ethical considerations

8 Conclusion and future work

9 Supplementary information

Acknowledgements

Publisher's Note

Appendix A

Prediction examples

Premium Partner