Vector Databases and Embedding Models: Comparative evaluation of performance in semantic retrieval in Portuguese
DOI:
https://doi.org/10.33448/rsd-v14i10.49768Keywords:
Vector Databases, Embedding Models, Semantic Retrieval, Performance Evaluation, Portuguese Language.Abstract
The growth in the use of large-scale language models has intensified the demand for vector databases capable of handling high-dimensional semantic representations. This study aimed to comparatively evaluate different combinations of vector databases and multilingual embedding models, considering their applicability to semantic retrieval in the Portuguese language. The research is characterized as experimental and applied, conducted in a local environment, and structured in four stages: database construction, definition of selection criteria, implementation of an experimentation pipeline, and evaluation of relevance, diversity, and efficiency. Classic information retrieval metrics (Recall@k and nDCG) were analyzed, in addition to diversity and balance metrics (α-nDCG and ILD) and computational efficiency indicators (average latency, p95 latency, average CPU usage, RAM usage, and Queries per Second - QPS). The results showed that solutions such as Milvus and Weaviate stand out in scenarios with higher computational demand, while pgvector proved to be more efficient in terms of memory. Alternatives such as Chroma and pgvector demonstrated viability in smaller-scale contexts. Among the embedding models, consistent performance was observed in the multilingual models available on Hugging Face for tasks in Portuguese. As a contribution, this work presents a systematic empirical analysis that highlights the potential and limitations of different vector bank/embedding combinations, offering support for practical decisions in digital curation projects, data observatories, and recommendation systems in Portuguese.
References
Carbonell, J., & Goldstein, J. (1998). The use of MMR, diversity-based reranking for reordering documents and producing summaries. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 335–336). New York, NY: ACM. https://dl.acm.org/doi/10.1145/290941.291025
Carpineto, C., & Romano, G. (2012). A survey of diversity methods in information retrieval. ACM Computing Surveys (CSUR), 44(1), 1–50. https://doi.org/10.1145/2071389.2071390
Carvalho, P., Oliveira, R., Silva, M., & Pereira, T. (2025). Evaluating text representations for unsupervised legal semantic textual similarity in Brazilian Portuguese. Information and Data Technologies. Cham: Springer. https://doi.org/10.1007/s44248-025-00052-4
Clarke, C. L. A., Kolla, M., Cormack, G. V., Vechtomova, O., Ashkan, A., Büttcher, S., & MacKinnon, I. (2008). Novelty and diversity in information retrieval evaluation. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 659–666). New York, NY: ACM. https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=15004aadabd967ac722a28a9c3bb39cf5bc32605
Fernandes, L. C., Ribeiro, L. S., Castro, M. V. B., Pacheco, L. A. S., & Sandes, E. F. O. (2025). JurisTCU: A Brazilian Portuguese information retrieval dataset with query relevance judgments. arXiv preprint. https://arxiv.org/abs/2503.08379
Hartmann, N. S., Fonseca, E. R., Shulby, C., Silva, J., & Aluísio, S. M. (2017). Portuguese word embeddings: Evaluating on word analogies and natural language tasks. arXiv preprint. https://arxiv.org/abs/1708.06025
Järvelin, K., & Kekäläinen, J. (2002). Cumulated gain-based evaluation of IR techniques. ACM Transactions on Information Systems, 20(4), 422–446. https://faculty.cc.gatech.edu/~zha/CS8803WST/dcg.pdf
Joshi, S. (2025). Introduction to vector databases for generative AI: Applications, performance, future projections, and cost considerations. International Advanced Research Journal in Science, Engineering and Technology, 12(2), 79–91. https://doi.org/10.17148/IARJSET.2025.12210
Kerlinger, F. N. (1980). Metodologia da pesquisa em ciências sociais: Um tratamento conceitual (H. M. Rotundo, Trad.). São Paulo: EPU.
Latimer, C. (2024). The ultimate guide to vector database success in AI. Vectorize. https://vectorize.io/what-is-a-vector-database/
Lewis, P., Perez, E., Pothast, M., Kuznetsov, I., Levy, O., et al. (2020). Retrieval-augmented generation for knowledge-intensive NLP tasks. In Advances in Neural Information Processing Systems, 33, 9459–9474. Vancouver: Curran Associates. https://dl.acm.org/doi/abs/10.5555/3495724.3496517
Ma, L., Zhang, Z., Wang, X., Li, J., & Li, G. (2023). A comprehensive survey on vector database: Storage and retrieval techniques, challenges. arXiv preprint. https://arxiv.org/pdf/2310.11703
Malkov, Y. A., & Yashunin, D. A. (2018). Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(4), 824–836. https://doi.org/10.1109/TPAMI.2018.2889473
Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to information retrieval. Cambridge: Cambridge University Press. https://nlp.stanford.edu/IR-book/pdf/irbookonlinereading.pdf
Oliveira, L. L., Romeu, R. K., & Moreira, V. P. (2021). REGIS: A test collection for geoscientific documents in Portuguese. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 2363–2368). New York, NY: ACM. https://doi.org/10.1145/3404835.3463256
Pan, J. J., Wang, J., & Li, G. (2024). Survey of vector database management systems. The VLDB Journal. Berlin: Springer. https://doi.org/10.1007/s00778-024-00752-9
Radlinski, F., & Craswell, N. (2017). A theoretical framework for conversational search. In Proceedings of the 2017 Conference on Human Information Interaction and Retrieval (CHIIR’17) (pp. 117–126). New York, NY: ACM. https://doi.org/10.1145/3020165.3020183
Silva, J. R., & Caseli, H. M. (2021). Sense representations for Portuguese: Experiments with sense embeddings and deep neural language models. arXiv preprint. https://arxiv.org/abs/2109.00025
Souza, F. D., & Santos Filho, J. B. O. (2022). Embedding generation for text classification of Brazilian Portuguese user reviews: From bag-of-words to transformers. arXiv preprint. https://arxiv.org/abs/2212.00587
Srivastava, A. (2023). Choosing a vector database for your Gen AI stack. SingleStoreDB Blog. https://www.singlestore.com/blog/choosing-a-vector-database-for-your-gen-ai-stack/
Zhang, Y., Liu, S., & Wang, J. (2024). Are there fundamental limitations in supporting vector data management in relational databases? A case study of PostgreSQL. In IEEE 40th International Conference on Data Engineering (ICDE) (pp. 3640–3653). Utrecht: IEEE. https://doi.org/10.1109/ICDE60146
Downloads
Published
Issue
Section
License
Copyright (c) 2025 Patrick Fernandes Rezende Ribeiro, Juliane de Lima Pires, Patrick Alves Bastos, Roberto Rigo, Henrique Assumpção dos Reis, Kamilly Voitkiv Hubner, Maria Fernanda Zandoná Casagrande, Bruno de Paula Marafiga, Dante Krol Simba, Denise Fukumi Tsunoda

This work is licensed under a Creative Commons Attribution 4.0 International License.
Authors who publish with this journal agree to the following terms:
1) Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
2) Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
3) Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work.
