Projects
2025
Generating interesting high-dimensional data structures
A high-dimensional dataset is where each observation is described by many features, or dimensions. Such a dataset might contain various types of structures that have complex geometric properties, such as nonlinear manifolds, clusters, or sparse distributions. We can generate data containing a variety of structures using mathematical functions and statistical distributions. Sampling from a multivariate normal distribution will generate data in an elliptical shape. Using a trigonometric function we can generate a spiral. A torus function can create a donut shape. High-dimensional data structures are useful for testing, validating, and improving algorithms used in dimensionality reduction, clustering, machine learning, and visualization. Their controlled complexity allows researchers to understand challenges posed in data analysis and helps to develop robust analytical methods across diverse scientific fields like bioinformatics, machine learning, and forensic science. Functions to generate a large variety of structures in high dimensions are organized into the R package cardinalR
, along with some already generated examples.
2024
Visualise non-linear dimension reductions as models in a high-dimensional data space
Non-linear dimension reduction (NLDR) techniques such as t-SNE and UMAP provide a low-dimensional representation of high-dimensional data by applying a non-linear transformation. NLDR often exaggerates random patterns, sometimes due to the samples observed. However, NLDR views have an important role in data analysis because, if done well, they provide a concise visual (and conceptual) summary of high-dimensional distributions. The NLDR methods and (hyper)parameter choices can create wildly different representations, making it difficult to decide which is best, or whether any or all are accurate or misleading. To help assess the NLDR and decide on which, if any, is the most reasonable representation of the structure(s) present in the high-dimensional data, we have developed an algorithm to show the 2D NLDR model in the high-dimensional space, viewed with a tour. The algorithm hexagonally bins the data in the NLDR view and triangulates the bin centroids to create the wireframe. The high-dimensional representation is generated by mapping the centroids to high-dimensional points by averaging the values of observations belonging to each bin. The corresponding lines connected in the NLDR view also connect the corresponding high-dimensional points. The resulting model representation is overlaid on the high-dimensional data and viewed using a tour, a movie of linear projections. From this, one can see if the model fits everywhere, or better in some subspaces, or completely mismatches the data. Also, we can see how different methods may have similar summaries or quirks. The process is what Wickham et al. (2015) coined as viewing the “model in the data space,” and the NLDR view would be considered viewing the “data in the (NLDR) model space”. This work will interest analysts in understanding the complexity of high-dimensional data, especially in bioinformatics and ecology, where NLDR is prevalent. The algorithm is available via the R package quollr
.
2022
Computer Vision System For Automated Medicinal Plant Identification
Medicinal plants are usually identified by practitioners based on years of experience through sensory or olfactory senses. The other method of recognizing these plants involves laboratory-based testing, which requires trained skills, data interpretation which is costly and time-intensive. Automatic ways to identify medicinal plants are useful especially those that are lacking experience in herbal recognition. There is no standard mechanism in identification of medicinal plants. Therefore, we introduce an automatic approach based on statistical machine learning to identify medicinal plants. We refer to our medicinal plant classification algorithm as MEDIPI : MEDIicinal Plant Identitication. Our classification algorithm operates on the features extracted from the image leaves.
2021
Computer-aided Interpretable Features for Leaf Image Classification
Plant species identification is time consuming, costly, and requires lots of efforts, and expertise knowledge. In recent, many researchers use deep learning methods to classify plants directly using plant images. While deep learning models have achieved a great success, the lack of interpretability limit their widespread application. To overcome this, we explore the use of interpretable, measurable and computer-aided features extracted from plant leaf images. Image processing is one of the most challenging, and crucial steps in feature-extraction. The purpose of image processing is to improve the leaf image by removing undesired distortion. The main image processing steps of our algorithm involves: i) Convert original image to RGB (Red-Green-Blue) image, ii) Gray scaling, iii) Gaussian smoothing, iv) Binary thresholding, v) Remove stalk, vi) Closing holes, and vii) Resize image. The next step after image processing is to extract features from plant leaf images. We introduced 52 computationally efficient features to classify plant species. These features are mainly classified into four groups as: i) shape-based features, ii) color-based features, iii) texture-based features, and iv) scagnostic features. Length, width, area, texture correlation, monotonicity and scagnostics are to name few of them. We explore the ability of features to discriminate the classes of interest under supervised learning and unsupervised learning settings. For that, supervised dimensionality reduction technique, Linear Discriminant Analysis (LDA), and unsupervised dimensionality reduction technique, Principal Component Analysis (PCA) are used to convert and visualize the images from digital-image space to feature space. The results show that the features are sufficient to discriminate the classes of interest under both supervised and unsupervised learning settings.