ADVERTISEMENT

Weighted k-Nearest Neighbors Feature Selection (WkNN-FS)

Published by Peter Drotár

Department of Computers and Informatics, Technical University of Kosice, Letná 9, Košice, Slovakia

These findings are described in the article entitled Weighted nearest neighbors feature selection, recently published in the journal Knowledge-Based Systems (Knowledge-Based Systems 163 (2019) 749-761). This work was conducted by Peter Bugata and Peter Drotár from the Technical University of Kosice.

The current digital era has brought unprecedented amounts of data. The ubiquity of various sensors, cameras, a huge amount of interaction on social networks, and the penetration of technologies into biomedicine generate terabytes of data every day. Even though methodologies in data science are advancing and are able to process real-world applications’ big data, there are still some datasets or domains where preprocessing of the data is necessary.

ADVERTISEMENT

Before going deeper into preprocessing, let us explain how data is perceived by data scientists. The dataset can be written as a matrix consisting of rows (samples) and columns (features). The samples represent different measurements, for example, as with data taken from different patients. In this example, the columns are attributes related to one particular patient (e.g. measurements taken during a medical examination or a microarray-based gene expression profiling). Therefore, the matrix contains information about the attributes of multiple patients.

In some domains, the number of attributes is extremely high. In domains such as microarray-based gene expression profiling, this can be tens or even hundreds of thousands of attributes. The high dimensionality of data can have several negative effects on machine learning algorithms. It leads to prolonged computational times and over-fitting (reduced prediction performance due to an inability of the model to generalize). 

This problem is even more emphasized in datasets with a small number of samples. In the case of a small sample of high-dimensional datasets, some preprocessing has to be performed for machine learning algorithms to work well. The frequently employed solution is a reduction of the data dimensionality. Dimensionality reduction can be achieved through the feature selection (FS). The concept is illustrated in Fig. 1. The FS algorithm analyses the data and identifies relevant attributes.  All features evaluated as irrelevant are thrown away so the dimensionality of the dataset is significantly reduced.

Figure 1 Feature selection. Image courtesy Peter Drotár

In a recent publication, Bugata and Drotar proposed a new FS algorithm, WkNN-FS, based on distance and attribute weighted k-nearest neighbors (KNN) with gradient descent as an iterative optimization algorithm for finding the function minima.  KNN is a supervised classification algorithm that determines the value of the target variable according to the values of the nearest data points. In conventional KNN, k nearest neighbors are all equally relevant for prediction. WkNN-FS uses distance and attributes weighted KNN, which means that the  data points closer to predicted data point are given a higher weight. Similarly, the weight of an attribute is determined according to its usefulness.

ADVERTISEMENT

In this approach, the target variable is determined by the most relevant attributes of the most nearest neighbors. Assuming this weighting scheme, the task is to find a weighting vector that determines the importance of the features. Basically, this is an optimization task that is solved using a gradient descent algorithm. Gradient descent is an iterative optimization algorithm that utilizes the gradient of the function to find its minima. Computed weights assigned to features indicate the significance of different features.

Experimental results showed that the method is able to correctly identify significant features and boost prediction performance for high dimensional datasets. Another advantage of WkNN-FS is that its structure allows for computationally efficient TensorFlow implementation and efficient performance on GPUs. The WkNN-Fs is freely available at GITHUB https://github.com/bugatap/WkNN-FS.

Comments

READ THIS NEXT

Microalgae: The Green Mines For Fueling The Future And Mitigating Heavy Metals

The energy crisis, water shortage, and pollution are among the major challenges confronting sustainable environmental existence. Renewable energy derived from […]

Hybrid Polysaccharide Membranes For Dehydration Processes

Solvent dehydration processes have a high economic and environmental relevance in the pharmaceutical, fine-chemistry, and chemical industries. In particular, ethanol […]

Study On How Resilience Affects Stress And Alcohol-Related Behaviors In National Guard Members

Our recent study is the first to show how stressors are related to serious consequences of alcohol use in military […]

How Aging Reduces Neurogenesis In The Hippocampus And Increases The Risk Of Dementia

We all know that as we get older, our minds start to get a little fuzzy. For an increasing number […]

Elevated Temperature And Lower Ocean pH May Limit Larval Supply In The Florida Stone Crab

Many coastal watersheds have suffered from decades of urbanization which has diverted stormwater, carrying nutrient pollution, into coastal ocean waters. […]

What Are Scallops?

Scallops are shellfish found in oceans around the world, they cover several species of saltwater clams and mollusks. The term […]

Asymmetry In Pricing Gasoline Based On Oil Costs Using National Samples

The relationship between oil and gasoline prices has been the subject of investigation over the last three decades, igniting certain […]

Science Trends is a popular source of science news and education around the world. We cover everything from solar power cell technology to climate change to cancer research. We help hundreds of thousands of people every month learn about the world we live in and the latest scientific breakthroughs. Want to know more?