Dr Reza Drikvandi reza.drikvandi@durham.ac.uk
Associate Professor
Sparse principal component analysis for natural language processing
Drikvandi, Reza; Lawal, Olamide
Authors
Olamide Lawal
Abstract
High dimensional data are rapidly growing in many different disciplines, particularly in natural language processing. The analysis of natural language processing requires working with high dimensional matrices of word embeddings obtained from text data. Those matrices are often sparse in the sense that they contain many zero elements. Sparse principal component analysis is an advanced mathematical tool for the analysis of high dimensional data. In this paper, we study and apply the sparse principal component analysis for natural language processing, which can effectively handle large sparse matrices. We study several formulations for sparse principal component analysis, together with algorithms for implementing those formulations. Our work is motivated and illustrated by a real text dataset. We find that the sparse principal component analysis performs as good as the ordinary principal component analysis in terms of accuracy and precision, while it shows two major advantages: faster calculations and easier interpretation of the principal components. These advantages are very helpful especially in big data situations.
Citation
Drikvandi, R., & Lawal, O. (2023). Sparse principal component analysis for natural language processing. Annals of Data Science, 10(1), 25-41. https://doi.org/10.1007/s40745-020-00277-x
Journal Article Type | Article |
---|---|
Acceptance Date | Apr 30, 2020 |
Online Publication Date | May 18, 2020 |
Publication Date | 2023-02 |
Deposit Date | Oct 6, 2020 |
Publicly Available Date | Jan 25, 2023 |
Journal | Annals of Data Science |
Print ISSN | 2198-5804 |
Electronic ISSN | 2198-5812 |
Publisher | Springer |
Peer Reviewed | Peer Reviewed |
Volume | 10 |
Issue | 1 |
Pages | 25-41 |
DOI | https://doi.org/10.1007/s40745-020-00277-x |
Files
Published Journal Article
(893 Kb)
PDF
Publisher Licence URL
http://creativecommons.org/licenses/by/4.0/
Copyright Statement
This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
You might also like
High dimensional change points: challenges and some proposals
(2023)
Conference Proceeding
MEGH: A parametric class of general hazard models for clustered survival data
(2022)
Journal Article
Diagnostic tools for random effects in general mixed models
(2021)
Conference Proceeding
Invited session "Recent advances in biostatistics"
(2020)
Conference Proceeding
Downloadable Citations
About Durham Research Online (DRO)
Administrator e-mail: dro.admin@durham.ac.uk
This application uses the following open-source libraries:
SheetJS Community Edition
Apache License Version 2.0 (http://www.apache.org/licenses/)
PDF.js
Apache License Version 2.0 (http://www.apache.org/licenses/)
Font Awesome
SIL OFL 1.1 (http://scripts.sil.org/OFL)
MIT License (http://opensource.org/licenses/mit-license.html)
CC BY 3.0 ( http://creativecommons.org/licenses/by/3.0/)
Powered by Worktribe © 2024
Advanced Search