ADVERTISEMENT

Weighted k-Nearest Neighbors Feature Selection (WkNN-FS)

Published by Peter Drotár

Department of Computers and Informatics, Technical University of Kosice, Letná 9, Košice, Slovakia

These findings are described in the article entitled Weighted nearest neighbors feature selection, recently published in the journal Knowledge-Based Systems (Knowledge-Based Systems 163 (2019) 749-761). This work was conducted by Peter Bugata and Peter Drotár from the Technical University of Kosice.

The current digital era has brought unprecedented amounts of data. The ubiquity of various sensors, cameras, a huge amount of interaction on social networks, and the penetration of technologies into biomedicine generate terabytes of data every day. Even though methodologies in data science are advancing and are able to process real-world applications’ big data, there are still some datasets or domains where preprocessing of the data is necessary.

ADVERTISEMENT

Before going deeper into preprocessing, let us explain how data is perceived by data scientists. The dataset can be written as a matrix consisting of rows (samples) and columns (features). The samples represent different measurements, for example, as with data taken from different patients. In this example, the columns are attributes related to one particular patient (e.g. measurements taken during a medical examination or a microarray-based gene expression profiling). Therefore, the matrix contains information about the attributes of multiple patients.

In some domains, the number of attributes is extremely high. In domains such as microarray-based gene expression profiling, this can be tens or even hundreds of thousands of attributes. The high dimensionality of data can have several negative effects on machine learning algorithms. It leads to prolonged computational times and over-fitting (reduced prediction performance due to an inability of the model to generalize). 

This problem is even more emphasized in datasets with a small number of samples. In the case of a small sample of high-dimensional datasets, some preprocessing has to be performed for machine learning algorithms to work well. The frequently employed solution is a reduction of the data dimensionality. Dimensionality reduction can be achieved through the feature selection (FS). The concept is illustrated in Fig. 1. The FS algorithm analyses the data and identifies relevant attributes.  All features evaluated as irrelevant are thrown away so the dimensionality of the dataset is significantly reduced.

Figure 1 Feature selection. Image courtesy Peter Drotár

In a recent publication, Bugata and Drotar proposed a new FS algorithm, WkNN-FS, based on distance and attribute weighted k-nearest neighbors (KNN) with gradient descent as an iterative optimization algorithm for finding the function minima.  KNN is a supervised classification algorithm that determines the value of the target variable according to the values of the nearest data points. In conventional KNN, k nearest neighbors are all equally relevant for prediction. WkNN-FS uses distance and attributes weighted KNN, which means that the  data points closer to predicted data point are given a higher weight. Similarly, the weight of an attribute is determined according to its usefulness.

ADVERTISEMENT

In this approach, the target variable is determined by the most relevant attributes of the most nearest neighbors. Assuming this weighting scheme, the task is to find a weighting vector that determines the importance of the features. Basically, this is an optimization task that is solved using a gradient descent algorithm. Gradient descent is an iterative optimization algorithm that utilizes the gradient of the function to find its minima. Computed weights assigned to features indicate the significance of different features.

Experimental results showed that the method is able to correctly identify significant features and boost prediction performance for high dimensional datasets. Another advantage of WkNN-FS is that its structure allows for computationally efficient TensorFlow implementation and efficient performance on GPUs. The WkNN-Fs is freely available at GITHUB https://github.com/bugatap/WkNN-FS.

Comments

READ THIS NEXT

Algal Process Affords Sustainable Wastewater Treatment And Disinfection

Publicly owned treatment works (POTWs) are required to reduce the pollutants in urban wastewaters to statutory levels prior to discharging […]

What Is A Carpet Beetle (Bug)?

The carpet beetle is a colloquial name given to a number of species of beetle classified in the Desmistidae family. Found all […]

Integrated Pollination Management

Today we separate the two most common ways of managing pollination services in crops. One is the use of managed […]

Red And Processed Meat Are The Risk Factors With The Highest Impact In The Prevention Of Colorectal Cancer

Research from the Danish Cancer Society shows that the intake of red and processed meat are the risk factors with […]

Climate Change Impacts On Fisheries And Biodiversity Of Pacific Island Countries And Territories

People often think of Pacific Island Countries and Territories as a tropical paradise teeming with coral reefs and abundant and […]

Tumors In A Dish: Tools That Teach Us How To Improve Cancer Treatments

The growth of new blood vessels from existing blood vessels – angiogenesis – has been a prominent target of cancer […]

Yo-Yo Diets Don’t Work

Published by Jack Pryor and Stephanie Simonds Department of Physiology, Monash Biomedical Discovery Institute, Monash University, Australia These findings are […]

Science Trends is a popular source of science news and education around the world. We cover everything from solar power cell technology to climate change to cancer research. We help hundreds of thousands of people every month learn about the world we live in and the latest scientific breakthroughs. Want to know more?