Tratamento de texto extraído de livros digitais para a indexação em mecanismo de busca

Glauber José Vaz; Pedro Henrique Rodrigues da Cunha da Veiga; Rafael Gomes Caldas; Wyviane Carlos Lima Vidal; Cristiane Pereira de Assis; Jorge Luiz Correa; Maria Fernanda Moura

doi:10.26512/rici.v16.n2.2023.42740

Authors

Glauber José Vaz Embrapa Agricultura Digital, Campinas, SP, Brasil https://orcid.org/0000-0002-4527-5150
Pedro Henrique Rodrigues da Cunha da Veiga IZagro, Franca, SP, Brasil https://orcid.org/0000-0001-8913-9129
Rafael Gomes Caldas IZagro, Franca, SP, Brasil https://orcid.org/0000-0001-6837-455X
Wyviane Carlos Lima Vidal Embrapa Agroenergia, Brasilia, DF, Brasil https://orcid.org/0000-0002-0994-0117
Cristiane Pereira de Assis Embrapa Sede, Superintendência de Comunicação, Brasília, DF, Brasil https://orcid.org/0000-0003-2963-5125
Jorge Luiz Correa Embrapa Agricultura Digital, Campinas, SP, Brasil https://orcid.org/0000-0003-2336-0933
Maria Fernanda Moura Embrapa Agricultura Digital, Campinas, SP, Brasil https://orcid.org/0000-0001-9334-2832

DOI:

https://doi.org/10.26512/rici.v16.n2.2023.42740

Keywords:

Digital curation, Information retrieval, Text processing, Dissemination of information, Indexing, Digital books

Abstract

This article presents a methodology for treating texts extracted from digital books from Embrapa's 500 Questions 500 Answers Collection to index their content and to allow its access via a search engine. The methodology involves extracting the essential elements of the books, such as images and HTML files; pre-processing them; analyzing and editing them; and building suitable components for their indexing. In addition to a large amount of human analysis, the technologies used are Epub format for digital books, the Sigil editor, scripts for text processing, web representation standards, and Elasticsearch. The results show that this method can provide well-formatted texts for indexing and use in search engines, giving a rich user experience and enabling the construction of new digital solutions. Therefore, such a digital curation is essential for adding value to digital resources and meeting specific user needs.

Downloads

Download data is not yet available.

Author Biographies

Glauber José Vaz, Embrapa Agricultura Digital, Campinas, SP, Brasil

Glauber José Vaz received his bachelor's degree in Computer Science from the Federal University of Uberlândia in 2000 and his master’s degree from the Universidade Estadual de Campinas (Unicamp) in 2003. From 2003 to 2010, he taught computing courses in three institutions, including at Unicamp. Since 2010, he has worked in research, development, and innovation in computing applied to agriculture, at the Digital Agriculture unit of the Brazilian Agricultural Research Corporation (Embrapa), in Campinas, Brazil. His research interests include information retrieval, data science, and digital agriculture.

Pedro Henrique Rodrigues da Cunha da Veiga, IZagro, Franca, SP, Brasil

Pedro Henrique Rodrigues da Cunha da Veiga has a bachelor's degree in Computer Science from Universidade de Franca (2017) in São Paulo. He is a partner-director of technology at IZagro, agtech focusing on helping small/medium-sized producers to work with good regenerative practices using simple and clear information. He has experience in web application development, acting mainly in back-end technologies for data processing and traffic, using Java (Spring boot Data), Python, and NodeJS.

Rafael Gomes Caldas, IZagro, Franca, SP, Brasil

Rafael Gomes Caldas has a bachelor's degree in Computer Science from the Universidade de Franca (2017) and has been a developer at IZagro since November 2018. He has experience in web application development, mainly with JavaScript, TypeScript, relational and non-relational databases, and cloud computing technologies.

Wyviane Carlos Lima Vidal, Embrapa Agroenergia, Brasilia, DF, Brasil

Wyviane Carlos Lima Vidal has a bachelor's degree in Biological Sciences from the Universidade Federal da Paraíba, obtained in 1997, and a master's degree in Development and Environment from the Universidade Federal da Paraíba, obtained in 2001. She has worked as a researcher at the Brazilian Agricultural Research Corporation (Embrapa) since 2002. She has also worked at Embrapa Tabuleiros Costeiros, Aracaju, SE, from 2002 to 2004; Embrapa Informação Tecnológica, Brasília, DF, as co-editor of the journal Pesquisa Agropecuária Brasileira, from 2005 to 2012; as a book editor for Embrapa, between 2012 and 201; as well as in the Publishing and Production unit of the Communication and Information Management directorate of the General Secretariat of Embrapa, until March 2022. She is currently working at Embrapa Agroenergia, Brasília, DF.

Cristiane Pereira de Assis, Embrapa Sede, Superintendência de Comunicação, Brasília, DF, Brasil

Cristiane Pereira de Assis is Graduated in Agronomy (2002), and has a master's (2004) and doctorate (2008) degree in soils and plant nutrition, all of which were obtained from the Universidade Federal de Viçosa. She has experience in agronomy, with an emphasis on soil management and conservation. As part of the Capes-PNPD national post-doctoral program, she completed a post-doctoral fellowship in the Department of Soil at the Universidade Federal do Ceará from 2008 to 2010, with the project Soil Quality in the Irrigated Jaguaribe-Apodi, Ceará Perimeter. She was a professor of Agronomy at the Universidade Federal do Vale do São Francisco from 2010 to 2011. She is currently a researcher at the Brazilian Agricultural Research Corporation, where she served as a scientific editor of the journal Pesquisa Agropecuária Brasileira from April 2012 to January 2018. Cristiane currently works as an editor of technical-scientific books for Embrapa and is also part of the coordination team for the digital platforms at Embrapa Production Systems and the Embrapa Agency for Technological Information (Ageitec).

Jorge Luiz Correa, Embrapa Agricultura Digital, Campinas, SP, Brasil

Jorge Luiz Correa received his bachelor's and master's degrees in Computer Science from Universidade Estadual Paulista (UNESP) in the field of security of networks and computer systems. He worked for six years as an analyst and researcher at ACME Cybersecurity Research Laboratory at UNESP. He also researched network attack detection at the National Institute of Science and Technology-Critical Embedded Systems (INCT-SEC). He was an IT consultant at the São Paulo State Department of Education. He is currently an infrastructure and security analyst at the Brazilian Agricultural Research Corporation (Embrapa), focusing on high-performance and cloud computing.

Maria Fernanda Moura, Embrapa Agricultura Digital, Campinas, SP, Brasil

Maria Fernanda Moura received her bachelor's degree in Statistics from the Universidade Estadual de Campinas (1987), her master's degree in Electrical Engineering from the Universidade Estadual de Campinas (1992), and her Ph.D. in Computer Sciences from the Universidade de São Paulo (2009). She has worked as a researcher at Embrapa Digital Agriculture since August 1989. She has experience in probability and statistics, with an emphasis on data and text mining, working mainly on data science, text mining, experimental statistics, and scientific software development (Python, R, C++, and Java).

References

Bax, M. P.; Resende, L. C. A Curadoria Digital de Dados Científicos no Campo da Ciência da Informação. Perspectivas em Ciência da Informação, Belo Horizonte, v. 25, n. especial, p. 233-251, 2020.

Brayner, A. A. Curadoria digital: novos modelos de participação pública na descrição de conteúdos em instituições culturais. Revista Ibero-Americana de Ciência da Informação, Brasília, v. 12, n. 1, p. 53–65, 2018.

Cordeiro, L. A. M.; Vilela, L.; Kluthcouski, J.; Marchão, R. L. (Ed.). Integração lavoura-pecuária-floresta: o produtor pergunta, a Embrapa responde. Brasília, DF: Embrapa, 2015. (Coleção 500 perguntas, 500 respostas).

Cunha, M .B. da. Digitalização: meta urgente para as bibliotecas. Revista Ibero-Americana de Ciência da Informação, Brasília, v. 15, n. 1, p. 1–5, 2022.

Elastic. Elasticsearch Guide: what is Elasticsearch, 2022. Disponível em: <https://www.elastic.co/guide/en/elasticsearch/reference/current/elasticsearch-intro.html>. Acesso em 29 mar. 2022.

EMBRAPA. Coleção 500 perguntas 500 respostas: Você pergunta, a Embrapa responde. Disponível em: https://mais500p500r.sct.embrapa.br/view/index.php. Acesso em 29 mar. 2022.

Gomes, L. I. E. Transformação digital e Inteligência Artificial nos serviços de informação: inovação e perspectivas para a Ciência da Informação no mundo pós-pandemia. Revista Ibero-Americana de Ciência da Informação, Brasília, v. 15, n. 1, p. 148–166, 2022.

Higgins, S. The DCC Curation Lifecycle Model. International Journal of Digital Curation, v. 3, n. 1, p. 134-140, 2008.

KasmanI, F.; Maniyar, R.; Narvekar, M. Content based search engine for e-books. In: 2020 6th International Conference on Advanced Computing and Communication Systems (ICACCS). Proceedings..., IEEE, 2020. p. 528-533.

Martins, R. D. Obstáculos para expansão do uso dos e-books na sociedade brasileira. RDBCI: Revista Digital de Biblioteconomia e Ciência da Informação, Campinas, v. 14, n. 2, p. 279-297, 2016.

Moreira, F. M. et al. Metadados para descrição de datasets e recursos informacionais do “Portal Brasileiro de Dados Abertos”. Perspectivas em Ciência da Informação, Belo Horizonte, v. 22, n. 3, p. 158-185, 2017.

National Research Council. Preparing the workforce for digital curation. Washington, DC: National Academies Press, 2015.

Oliver, G.; Harvey, R. Digital curation. Chicago: American Library Association, 2016.

Rehm, G. et al. QURATOR: innovative technologies for content and data curation. In: CONFERENCE ON DIGITAL CURATION TECHNOLOGIES (Qurator 2020), Berlin, Germany, 20-21 Jan. 2020. Proceedings..., 2020.

Rusbridge, C. et al. The digital curation centre: a vision for digital curation. In: IEEE INTERNATIONAL SYMPOSIUM ON MASS STORAGE SYSTEMS AND TECHNOLOGY, 2005. Proceedings... IEEE, 2005. p. 31-41.

Tartarotti, R. C .D.; Dal’Evedove, P. R. Avaliação da indexação em repositórios institucionais brasileiros: uma análise comparada entre USP, UNESP e UNICAMP. Revista Ibero-Americana de Ciência da Informação, Brasília, v. 14, n. 2, p. 583–599, 2021.

Teixeira, M. V.; Spiassi, A. O resumo como instrumento de recuperação da informação nos catálogos de bibliotecas. Revista Ibero-Americana de Ciência da Informação, Brasília, v. 15, n. 1, p. 76–88, 2022.

Teixeira, T. M .C.; Valentim, M. L. P. Processo de busca e recuperação de informação em ambientes organizacionais: uma reflexão teórica sobre a subjetividade da informação. Perspectivas em Ciência da Informação, Belo Horizonte, v. 22, p. 82-97, 2017.

Vaz, G. J.; Veiga, P. H. R.; Moura, M. F. Content from the books of Embrapa's 500 Questions 500 Answers Collection (Coleção 500 Perguntas 500 Respostas) treated to be used in digital solutions, Redape, v. 1, 2022. Disponível em: <https://doi.org/10.48432/YIGNPF>. Acesso em 20 dez. 2022.

W3C EPUB 3 Community Group. Epub 3.2: Final Community Group Specification 08 May 2019, 2019a. Disponível em: <https://www.w3.org/publishing/epub32/epub-spec.html>. Acesso em 29 mar. 2022.

W3C EPUB 3 Community Group. Epub Content Documents 3.2: Final Community Group Specification 08 May 2019, 2019b. Disponível em: <https://www.w3.org/publishing/epub32/epub-contentdocs.html>. Acesso em 29 mar. 2022.

Treatment of text extracted from digital books for search engine indexing

Authors

DOI:

Keywords:

Abstract

Downloads

Author Biographies

Glauber José Vaz, Embrapa Agricultura Digital, Campinas, SP, Brasil

Pedro Henrique Rodrigues da Cunha da Veiga, IZagro, Franca, SP, Brasil

Rafael Gomes Caldas, IZagro, Franca, SP, Brasil

Wyviane Carlos Lima Vidal, Embrapa Agroenergia, Brasilia, DF, Brasil

Cristiane Pereira de Assis, Embrapa Sede, Superintendência de Comunicação, Brasília, DF, Brasil

Jorge Luiz Correa, Embrapa Agricultura Digital, Campinas, SP, Brasil

Maria Fernanda Moura, Embrapa Agricultura Digital, Campinas, SP, Brasil

References

Downloads

Published

How to Cite

Issue

Section

License

Make a Submission

Language

ISSN

Indexes

License

Digital preservation

Most read