Hello folks,

To clarify the concept of clustering, which is a reoccurring theme in machine learning area (machine learning), we made a video tutorial that demonstrates a clustering problem that we can solve visually, then with finalize with a real case and some conclusions.  It is important to mention that other areas may benefit from this technique by targeting markets where you can meet different audiences according to their characteristics. We will use a video example

Below is the description of the video for those who like reading.

To facilitate the absorption of the concept, we will use a visual-based example. So, imagine that you have a textile factory and you want to produce as many flags as possible in the shortest time as with fewer materials as possible. Considering that there are around 200 national flags and each has different colors and shapes, we are interested to know which color patterns and shapes exist to optimize and organize the production line. That’s the idea, reduce costs and time while maintaining quality and volume.

All flags

Figure 1 – Representation of raw data without patterns detected

A good clustering algorithm should be able to identify patterns out of the raw data like we humans can visually identify looking at the Italian, Irish and Mexican flags like in the example below.  One factor that differentiates clustering algorithms from the classifying algorithms is that they have no hints about the patterns to study the model they must figure out automatically and this is a big challenge for practitioners.


Figure 2: Cluster zero (0) composed of the Italian, Irish and the Mexican flags.

In this context, as important as to identify groups with similarities between each other and finding individuals who do not resemble any other element. The so-called outliers, which are the exceptions.


Figure 3: Cluster six (6) composed of the flag of Nepal. An exception.

Finally, as the result of a good clustering process, we have the groups formed by the flags that have similar features and isolated individuals being the outliers.


Figure 3: Clusters formed at the end of visual human-based processing.

One of the most important factors of clustering is the number of groups where the elements will be allocated. In many cases, we have observed very different results while applying the same data, and same parameterization in different algorithms. This is very important. See below what could be the result of an inaccurate clustering.


Figure 4: Clusters result of a wrong clusterization

So, a practical question is:

Would you invest your money in this?

Probably not, and solving this problem is our challenge. A real application that we carried out was to identify the main characteristics of patients who don’t show up to their medical appointments, the well-known no-show problem that has deep implications in offices, clinics, and hospitals. The result was an amazing group with 50% of the analyzed data, which really deserves a specific policy. Doesn’t this give reason to the chief financial officers of these organizations?

Other possible applications of the clustering strategy were presented in this post “14 sectors for application of Big Data and data necessary for analysis.”

Some conclusions

  • Our vision is very powerful clustering images as in the case of flags.
  • It is humanly impossible to do analysis and logical correlations of numbers from a large database, so the clustering algorithms were created.
  • The accuracy of the results of clustering is crucial for making investment decisions.
  • Several sectors can benefit from this management approach.

Thank you!

VORTX Big Data

Aquarela developed VORTX Big Data to make predictive analytics a lot easier, more precise and more robust than current solutions on the market with significant impact on business problems such as: Churn reduction, business scenarios discovery, predictive maintenance, market segmentation and healthcare resource optimization.