Natural language processing and machine learning in the categorization of scientific papers: a study around “cultural heritage”
DOI:
https://doi.org/10.26512/rici.v16.n1.2023.47537Keywords:
Machine learning, Natural language processing, Neural network algorithm, Hierarchical clustering algorithm, Cultural heritageAbstract
Aims to verify the potential of applying Natural Language Processing (NLP) and Machine Learning (ML) techniques in the thematic categorization of scientific articles on the theme “cultural heritage” from two situations in which categories are established a priori and later. Applied research is developed, with quantitative and qualitative results, where the first corpus consisting of scientific articles in Portuguese, on a thematic basis of Information Science, manually selected and categorized; and the second corpus, composed of scientific articles in English retrieved from the Web of Science, automatically categorized by search strategies and application of Booleans. Both were submitted to two categorization test procedures (supervised and unsupervised algorithm). The results show that in both, the participation of the researcher is essential in defining the representativeness of the chosen sample, and this has an impact on the precision and accuracy of the applied algorithms. The importance of detailing and rigor in the pre-processing of data and sample size is highlighted, however, it is emphasized that, in the case of this study, only a larger volume of data did not guarantee that the results were representative from the point of view of the domain studied, which warns that there are always multidisciplinary discussions and analyzes that allow verifying and readjusting the sample parameters.
Downloads
References
BORKO, H. Information science: what is it? American Documentation, Washington, v. 19, n. 1, p. 3-5, Jan. 1968.
CONEGLIAN, C. S. Recuperação da Informação com abordagem semântica utilizando Linguagem Natural: a Inteligência Artificial na Ciência da Informação. 2020. 194 f. Tese (Doutorado) - Curso de Programa de Pós-Graduação em Ciência da Informação, Universidade Estadual Paulista, Marília, 2020. Disponível em: https://repositorio.unesp.br/bitstream/handle/11449/193051/coneglian_cs_dr_mar.pdf?sequence=3&isAllowed=y. Acesso em: 08 set. 2022.
FERNEDA, E. Recuperação de informação: análise sobre a contribuição da ciência da computação para a ciência da informação. 2003. 137 f. Tese (Doutorado) - Curso de Programa de Pós-Graduação em Ciência da Informação, Universidade Estadual Paulista, Marília, 2003. Disponível em: https://teses.usp.br/teses/disponiveis/27/27143/tde-15032004-130230/fr.php. Acesso em: 08 set. 2022.
JORDAN, M. I.; MITCHELL, T. M. Machine learning: Trends, perspectives, and prospects. Science, v. 349, n. 6245, p. 255-260, 2015. Disponível em: https://www.science.org/doi/abs/10.1126/science.aaa8415. Acesso em: 08 set. 2022.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2023 Ananda Fernanda de Jesus, Maria Lígia Triques, José Eduardo Santarem Segundo, Ana Cristina de Albuquerque
This work is licensed under a Creative Commons Attribution 4.0 International License.
Copyright Notice
Authors who publish in this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under the Creative Commons Attribution License 4.0, allowing the sharing of work and recognition of the work of authorship and initial publication in this journal.
- Authors are able to take on additional contracts separately, non-exclusive distribution of the version of the paper published in this journal (ex.: distribute to an institutional repository or publish as a book), with an acknowledgment of its initial publication in this journal.
- Authors are permitted and encouraged to distribute their work online (eg.: in institutional repositories or on their website) at any point before or during the editorial process, as it can lead to productive exchanges, as well as increase the impact and citation the published work.