To demonstrate how VORTX works, I selected a well-known dataset with information about the passengers who embarked on Titanic. Despite the tragic event, this dataset is fairly rich in details and has been widely used in Machine Learning communities since it allows the application of several Big Data techniques.
In this case, I am going to apply VORTX, which it is Big Data tool focused giving automatic segmentation plus other important decision-making indicators. This technique is called clustering. More information about this on this post (How can big data clustering strategy help business). In the conclusion section, I give some ideas on how it help businesses by means of this innovative approach.
Titanic Dataset summary
According to Encyclopedia Titanica “On 10 April 1912, the new liner sailed from Southampton, England with 2,208 passengers and crew, but four days later she collided with an iceberg and sank: 1496 people died and 712 survived”. For this analysis the data we had access we had the following figures:
- 1309 people on board of which 500 survived (38%) and 809 (62%) died.
- The average age of 29.88 years (estimated).
- 466 women of which 127 died and 339 survived.
- 843 man, of which 682 died and 161 survived.
- Ticket cost on average £53.65 per woman while £76.60 for man.
For more details on the complete dataset – Google for Titanic Dataset.
Factors under analysis
Unfortunately, 267 passengers (20.39%) had to be excluded from the analysis due to missing age values. Furthermore, out of 15 factors presented in the original file, I select the numerical ones with stronger weights calculated by VORTX. Usually, we classify factors, variables or data attributes in the following 3 categories:
- Protagonist – Factors with strong positive influence to generate a valuable pattern with clarity.
- Antagonist – Factors with noise or unclear patterns and negative influence that play against the protagonist.
- Supporting – Factors that do not play a significant role in changing the path of the analysis, but can enrich the results.
According to the influence power, the protagonists chosen for this analysis were:
- Age of the passenger = 87.85%
- How much each passenger paid to embark = 72.69%
- Number of parents on the ship = 71.69%
- Number of siblings or spouses on the ship = 72.42%
During the calculation the gender that indicates if the passenger was male or female tended to play an antagonist role, meaning the absence of a pattern to form the groups dropping the dataset sharpness to 7%. Therefore, it was removed.
VORTX Results and group characteristics
After processing, VORTX resulted in the following indicators, which most of them are not offered by other algorithms, therefore, I give a brief explanation for each of them:
- Dataset Sharpness = 33.64%. It shows how clear or confident the machine is about the discovered grouping patterns. According to our dataset quality scale, sharpness above 20% is already useful for decision making.
- Automatic discovery of segments (groups) = 8. This is a function that makes the whole process a lot easier for the data analyst. Unlike k-means and other algorithms, VORTX finds the right (ideal) number of groups by itself reducing dramatically the segmentation errors that topically happened.
- Clustering Distinctness = How much different the elements of each group are in relation to the overall group that makes them a group. The most distinctive one is number 5 with 51.48% (darker color) and the least one group 1 with 8.58%. This means that elements from group 5 tend to more homogeneous than the other groups.
By analyzing the groups and checking against the ones who survived or not the trip I came to the survival rate of each group plus the average Ticket Fare, so if you have the characteristics of the group 5 or 7 you would have better chances of surviving.
Naming the groups
To operationalise a managing strategy in any section you need to study the characteristics of each group and name them. Therefore, by looking at the key predominant characteristics of each group or also persona, let’s have a visual comparison of just 4 groups according to the factor “AGE”. The higher it goes, means the greater number of passengers with that characteristic. Those factors can be easily studied interactively on the VORTX DataScope.
Yet, another option is to look straight to the grouped data. In this case, I took a screenshot of the classified data of the group number 5, which has the most distinct passengers on the whole ship, probably right young people traveling with the whole family.
Conclusions and Recommendations
The most typical passenger is a young person with an average age of 21 years and who paid on average £26.35 while the less typical passenger is one alone on the group 8 who had 38 years old paid £7.775, was traveling with both parents plus 4 siblings.
Looking at the case with a more than a thousand records is not a great use to find out theses profiles, however, if you have millions of transactions, millions of clients or patients, the tool could serve as the key tool to optimize your operation, reducing costs and better aiming at your public, so:
- Who is the most typical client you have?
- What are the characteristics of each group?
- What is the total cost or revenue per group?
- What groups represent 80% of your cost or revenues?
- Which groups do you want to address your strategy and the ones you don’t want to?
- What are the Protagonist, Antagonist, Supporting factors that most affect your strategy?
- The persona created by VORTX matches the persona you have today? Benchmark it!
That was it, for now, hope this could be interesting and useful to plan your big decisions ahead. In case you need a little help let us know.
Thanks to the whole Aquarela’s team that are unstoppable making data analytics a lot easier and richer every single day.
VORTX Big Data
Aquarela developed VORTX Big Data to make predictive analytics a lot easier, more precise and more robust than current solutions on the market with significant impact on business problems such as: Churn reduction, business scenarios discovery, predictive maintenance, market segmentation and healthcare resource optimization.