Abstract
In the realm of high-dimensional single-cell sequencing data analysis, the accurate measurement of similarity between cells is pivotal. However, conventional metrics like Euclidean distance after L1-normalization may fail by losing distinguishable information when handling high-dimensional data, where the distance between different observations gradually converges to a shrinking interval. In this article, we use distance entropy to quantify the amount of information contained in the distances, and discuss the influence of normalization by different p-norms and the defect of Euclidean distance. We discover that observation differences are better preserved when normalizing data by a higher p-norm and using geodesic distance rather than Euclidean distance as the similarity measurement. We further identify that L2-normalization onto the hypersphere is often sufficient in preserving delicate differences even in relatively high dimensional data while maintaining computational efficiency. Subsequently, we present hypersphere t-distributed stochastic neighbor embedding (HS-SNE), a hypersphere-representation-system-based augmentation to t-SNE, which effectively addresses the intricacy of high-dimensional data visualization and similarity measurement. Our results on multiple single-cell sequencing datasets show that this hypersphere representation system has improved resolution to identify more subtle differences between high-dimensional data points, while balancing distance entropy preservation and computational efficiency.
京公网安备11010802044758号
Comments on this article