Perpustakaan Digital - Digilib ITB

CO-REGULARIZATION FOR HANDLING NOISY LABELS IN INDONESIAN NAMED ENTITY RECOGNITION DATASETS

396 views

Save At List

Penulis	:	Fahmi Sajid [23522028]
Kontributor / Dosen Pembimbing	:	Dr. Masayu Leylia Khodra, S.T, M.T.
Jenis Koleksi	:	Tesis
Tahun Terbit	:	2024
Penerbit	:	Informatika
Fakultas	:	Sekolah Teknik Elektro dan Informatika
Subjek	:
Kata Kunci	:	Named Entity Recognition, Co-Regularization, Soft voting, Transformer, IDNER News 2K
Sumber	:
Staf Input/Edit	:	Dessy Rondang Monaomi
File	:	1 file
Tanggal Input	:	02 Feb 2025

Dokumen Asli

PUBLIC Open In Flipbook Dessy Rondang Monaomi

Named Entity Recognition (NER) aims to identify and classify entities in text, such as names of people, locations, organizations, and other categories. However, developing NER models faces significant challenges, particularly the presence of noisy labels in datasets. These noisy labels can affect model accuracy and generalization, especially when training data labels are inconsistent or incorrect. In the context of the Indonesian language, developing NER models presents even more complex challenges. Indonesian is known for its diverse contextual usage and the presence of homonyms, which often cause ambiguity in annotation. This makes it difficult for NER models to accurately learn patterns from training data. In the S&N dataset, 1,112 noisy labels (2.92%) were found in the training data, while the test dataset contained 579 noisy labels (5.46%). These errors can be classified into several types, such as LOC entities frequently mislabeled as ORG or O, and ORG entities sometimes mistakenly annotated as LOC or O. These inconsistencies in annotation patterns can significantly reduce model performance if not addressed. Manual re-annotation successfully corrected most of these errors; however, this process requires substantial human resources, time, and cost. To overcome this challenge, this thesis employs the Co-Regularization technique with inference modification based on soft voting. The Co-Regularization technique involves parallel training of multiple transformer models with a combination of task-specific loss and agreement loss to maintain consistency in predictions across models. Agreement loss helps models learn more relevant general patterns by reducing dependence on incorrectly labeled data. This approach enables the model to produce more stable and robust predictions, even when dealing with a dataset containing noisy labels. During the inference stage, prediction probabilities from multiple models are combined using the soft voting method. This method provides higher flexibility in capturing consensus across models and improves prediction accuracy, particularly in cases of long-phrase entities or ambiguous contexts. The IDNER News 2K dataset was used as the foundation for this research. A preprocessing phase was conducted to prepare the data, followed by feature extraction using embeddings such as FastText and pretrained transformers like IndoBERT and XLM-R. The NER model was then developed using the Co- v Regularization approach. Performance evaluation was carried out using precision, recall, and F1-score metrics to measure the effectiveness of the proposed approach. Experimental results show that the Co-Regularization technique improves the F1-score up to 0.9603. Furthermore, the inference modification using soft voting enhanced the F1-score to 0.9644. This approach demonstrated resilience against datasets with uneven label distributions and diverse annotation quality. The novelty of this research lies in the integration of the Co-Regularization technique with inference modification based on soft voting. This technique is not only effective in addressing the problem of noisy labels but also offers an innovative approach to handling error patterns in NER predictions. This approach covers handling ambiguous contexts, category transitions, and long-phrase entities. Although its computational complexity is higher due to the probability calculations from all models, the aggregated results are more stable and accurate. This approach contributes significantly to the development of robust NER techniques, particularly in languages with limited high-quality datasets. Additionally, the proposed method has the potential to be adapted for various other NLP tasks, such as text classification and relation extraction. This approach also opens opportunities for further research in addressing noisy labels across different linguistic domains.

Perpustakaan Digital ITB

CO-REGULARIZATION FOR HANDLING NOISY LABELS IN INDONESIAN NAMED ENTITY RECOGNITION DATASETS

Artikel Terkait

Daftar Simpan Judul

CO-REGULARIZATION FOR HANDLING NOISY LABELS IN INDONESIAN NAMED ENTITY RECOGNITION DATASETS

Artikel Terkait