Breadcrumb navigation

Frovedis Machine Learning: Unsupervised Learning Dimensionality reduction using clustering and t-SNE Reduced learning time

Technical Articlescompared to scikit-learn

May dd, 2022
Shoichiro Yokotani, Application Development Expert
AI Platform division

Frovedis Machine Learning: Unsupervised Learning Dimensionality reduction using clustering and t-SNE Reduced learning time (compared to scikit-learn)

Unsupervised learning is a general term for learning to extract information from a data set that does not have any indicators of correct answers. In supervised learning, there is a set of output (correct answer) data corresponding to the input data. This allowed us to verify the correctness of the learning results. However, in unsupervised learning, there is no measure of correctness or incorrectness. It is generally difficult to judge whether the learning results are appropriate or not.

Unsupervised learning can be divided into two main categories. The first is clustering, which divides a dataset into groups according to their characteristics. For example, it can be used to group articles into political, economic, sports, etc. based on the similarity of words in individual news articles. The name of each group is not automatically assigned by unsupervised learning, so each group must be labeled by human judgment. The certainty of the grouped results also requires human judgment.

The second is data dimensionality reduction. Data sets with high-dimensional features are converted to lower dimensions. Data sets with high-dimensional features are difficult to understand by graphing their features. In such cases, dimensionality reduction is useful, such as creating important variables from high-dimensional features or transforming high-dimensional data into low-dimensional spatial data based on the distance between data in the high-dimensional space. Dimensionality reduction makes it easier to visually understand the features of the data.

As a sample of the unsupervised learning algorithm presented here, we will use a dataset of news articles that have been previously segmented into words and vectorized using Word2vec. As a first step, we use k-means clustering to group the words in the news articles. Next, we visualize the data features using t-SNE and clustering again using DBSCAN algorithm. We measure the training time for each learning algorithm using Frovedis and scikit-learn.


clustering_for_column

We can see that the word groups grouped by k-means are in eight coherent groups that are likely to be related to the world economy, trade, financial markets, and corporate information. DBSCAN, on the other hand, creates two large groups of words that could be included in any article. We can see that the contents of the groups vary greatly depending on the clustering algorithm.

The following table summarizes the time required for each learning process. In particular, there is a large difference in learning time for t-SNE. Furthermore, while Frovedis uses method="exact" for more accurate computation (more computation and longer time), scikit-learn uses method='barnes_hut' which reduces computation by approximation. For this dataset size, it took about 5 hours without scikit-learn's approximate calculation option. This is not practical.

learning algorithm Frovedis (sec) scikit-learn(sec) Ratio
t-SNE 18.73 198.75 x10.6
k-means 0.08 0.68 x8.5
DBSCAN 0.20 5.83 x29.2

Depending on the distribution shape of the data set, clustering algorithms may or may not be suitable. It is often time-consuming to repeatedly train multiple combinations of parameters to determine which algorithm is best suited for a given analysis. With SX-Aurora TSUBASA and Frovedis, we can reduce this burden in machine learning.