Big Data Scenario Discovery, why is it super useful for decision making?

Big Data Scenario Discovery, why is it super useful for decision making?

Hi everyone, in today’s demonstration, we are going to show you how Big Data Scenario Discovery can help decision making in a profound way in various sectors. We use AQUARELA VORTX Big Data, which is a tool that is a groundbreaking technology in the machine learning field. The Dataset used for the experiment was presented in the previous post about Big Data country auto-segmentation (clustering). The differences here is that this one also includes the Gini Index (found later on) and removes the electrification rate in rural areas. Also, it seeks systemic influences towards a GOAL, in this case, we selected Human Development Index, previously the segmentation just grouped similar countries according to their general characteristics.

The key questions for the experiment:

  1. How many Human Development Index scenarios exist in total? And which countries belong to them?
  2. Amongst 65 indexes, which of them have most influence to define a High or Low Human Development Index?
  3. What is the DNA (set of characteristics) of a High and Low Human Development scenario?

Alright, hang on for a minute! Before you see the results, take a look at all variables analysed in the previous post. Then try to figure out by yourself using the most of your intuition, what would be the answer to these 3 questions. This is a very fun and very useful cognitive task to scenario validation. OK?

Results after pushing the Discoverer button:

HDI - Total

This is the overall distribution of 188 countries, where most of the countries present HDI between 0.65 and 0.75. And very few above 0.90.  In total, there are 15 different HDI scenarios, which the first 3 correspond to more than 94% of the total and that is what we are to focus on.

Scenario 1

The most common scenario and the average HDI

Scenario 2

Countries with the lowest HDI

Scenario 3

Countries with the highest HDI

Where are they located?

Screen Shot 2016-09-15 at 20.21.36

What factors influence HDI the most and the least?

Ranking

The list marks the top and bottom 10 factors. The factor Intimate or Nonintimate partner Violence ever experienced 2001-2011 – Was automatically removed from the ranking as it does not correlate with HDI.

What is the DNA of each main scenario?

Screen Shot 2016-09-15 at 19.56.15

All factors presented at once. Note that the scales on X axis changes dynamically hovering the mouse on VORTX data scope screen.

Screen Shot 2016-09-15 at 19.56.06 Screen Shot 2016-09-15 at 19.55.57

Drilling down into the DNA

Under-Five Mortality rates vs HDI

Screen Shot 2016-09-15 at 19.51.05

Screen Shot 2016-09-15 at 19.51.19

Screen Shot 2016-09-15 at 19.51.30

Filtering visualisation by the most relevant factor and HDI (HDI is the focus of the analytics so it has the darker colour. Here we see that countries with the highest HDI have lowest levels of under-five mortality rate.

Gender Inequality Rate vs HDI

Screen Shot 2016-09-15 at 19.55.12

Screen Shot 2016-09-15 at 19.55.31

Screen Shot 2016-09-15 at 19.55.41

Gross National Income GNI per capta vs HDI

Screen Shot 2016-09-15 at 19.53.38 Screen Shot 2016-09-15 at 19.53.25 Screen Shot 2016-09-15 at 19.53.15

Insights and Conclusions of the study

The possibilities generating new knowledge from this Big Data strategy are endless, but we focused on just a few questions and few print screens to demonstrate its value. During this research, we found interesting to see the machine autonomously confirming some previous intuitions, while breaking some preconceptions. It is important to mention that we are not measuring causation as if one factor leads to another and vice-versa, the results show systemic correlations only. Here there are some of them that called our attention:

  • Gender inequality playing a strong role and inverse correlation in Human Development Index while we are living a transition of the industrial age to information where knowledge if surpassing the physical differences between genders.
  • Research and development having a direct correlation to HDI.
  • The United States having its own scenario due to its unique systemic characteristics.
  • Gross National Income GNI per capita leading the ranking and the values around 40 thousand dollars.
  • Public expenditure ahead of Education related indexes.

Business applications

Applying the same questions we had at the beginning of the article, now let’s see how they would look like for different business scenarios:

Sales

  • How many scenarios exist for your sales? Which customer segment belong to each scenario?
  • Amongst several business factors, which of them have the most influence to define a High or Low revenue?
  • What is the DNA (characteristics) of a High and Low revenue scenario?

Industry

  • How many production/maintenance scenarios exist for your production line? Which processes belong to each scenario?
  • Amongst several production factors, which of them have the most influence to define a High or Low outcome or High or Low maintenance/costs?
  • What is the DNA (characteristics) of a High and Low production/maintenance scenario?

Healthcare

  • How many patient scenarios exist for a specific disease or medical condition? Which patients belong to each scenario?
  • Amongst several patient characteristics, which of them have the most influence to result in High or Low levels of a specific disease or medical condition?
  • What is the DNA (characteristics) of a High and Low medical condition scenarios?

All in all, we expect that this article can help easy landing on the newest territories of machine learning and in case you need more information on how this solution applies to your business scenario, please let us know. If you found this analytics interesting and worth spreading, do so. Super thanks on behalf of Aquarelas team!

What is Aquarela Advanced Analytics?

Aquarela Analytics is Brazilian pioneering company and reference in the application of Artificial Intelligence in industry and large companies. With the Vortx platform and DCIM methodology, it serves important global customers such as Embraer (aerospace & defence), Scania and Randon Group (automotive), Solar Br Coca-Cola (beverages), Hospital das Clínicas (healthcare), NTS-Brasil (oil & gas), Votorantim Energia (energy), among others.

Stay tuned following Aquarela’s Linkedin!

How VORTX Big Data organises the world?

How VORTX Big Data organises the world?

Hello everyone,

The objective of this post is to show you what happens when we give several numbers to a machine (VORTX Big Data) and it finds out by itself how the countries should be organized into different boxes. This technique is called clustering! The questions we will answer in this post are:

  • How are countries segmented based on the world’s indexes?
  • What are the characteristics of each group?
  • Which factors are the most influential for the separation?

Here we go!

Data First – What comes in?

I have gathered 65 indexes of 188 countries of the world, the sources are mainly from:

  • UNDESA 2015,
  • UNESCO Institute for Statistics 2015,
  • United Nations Statistics Division 2015,
  • World Bank 2015,
  • IMF 2015.

Selected variables for the analysis were:

  1. Human Development Index HDI-2014
  2. Gini coefficient 2005-2013
  3. Adolescent birth rate 15-19 per 100k 20102015
  4. Birth registration under age 5 2005-2013
  5. Carbon dioxide emissions Average annual growth
  6. Carbon dioxide emissions per capita 2011 Tones
  7. Change forest percentile 1900 to 2012
  8. Change mobile usage 2009 2014
  9. Consumer price index 2013
  10. Domestic credit provided by financial sector 2013
  11. Domestic food price level 2009 2014 index
  12. Domestic food price level 2009-2014 volatility index
  13. Electrification rate or population
  14. Expected years of schooling – Years
  15. Exports and imports percentage GPD 2013
  16. Female Suicide Rate 100k people
  17. Foreign direct investment net inflows percentage GDP 2013
  18. Forest area percentage of total land area 2012
  19. Fossil fuels percentage of total 2012
  20. Freshwater withdrawals 2005
  21. Gender Inequality Index 2014
  22. General government final consumption expenditure – Annual growth 2005 2013
  23. General government final consumption expenditure – Perce of GDP 2005-2013
  24. Gross domestic product GDP 2013
  25. Gross domestic product GDP per capita
  26. Gross fixed capital formation of GDP 2005-2013
  27. Gross national income GNI per capita – 2011  Dollars
  28. Homeless people due to natural disaster 2005 2014 per million people
  29. Homicide rate per 100k people 2008-2012
  30. Infant Mortality 2013 per thousands
  31. International inbound tourists thousands 2013
  32. International student mobility of total tertiary enrolment 2013
  33. Internet users percentage of population 2014
  34. Intimate or no intimate partner violence ever experienced 2001-2011
  35. Life expectancy at birth- years
  36. Male Suicide Rate 100k people
  37. Maternal mortality ratio deaths per 100 live births 2013
  38. Mean years of schooling – Years
  39. Mobile phone subscriptions per 100 people 2014
  40. Natural resource depletion
  41. Net migration rate per 1k people 2010-2015
  42. Physicians per 10k people
  43. Population affected by natural disasters average annual per million people 2005-2014
  44. Population living on degraded land Percentage 2010
  45. Population with at least some secondary education percent 2005-2013
  46. Pre-primary 2008-2014
  47. Primary-2008-2014
  48. Primary school dropout rate 2008-2014
  49. Prison population per 100k people
  50. Private capital flows percentage GDP 2013
  51. Public expenditure on education Percentage GDP
  52. Public health expenditure percentage of GDP 2013
  53. Pupil-teacher ratio primary school pupils per teacher 2008-2014
  54. Refugees by country of origin
  55. Remittances inflows GDP 2013
  56. Renewable sources percentage of total 2012
  57. Research and development expenditure 2005-2012
  58. Secondary 2008-2014
  59. Share of seats in parliament percentage held by woman 2014
  60. Stock of immigrants percentage of population 2013
  61. Taxes on income profit and capital gain 205 2013
  62. Tertiary -2008-2014
  63. Total tax revenue of GDP 2005-2013
  64. Tuberculosis rate per thousands 2012
  65. Under-five Mortality 2013 per thousands

What comes out?

Let’s start looking at the map, where these groups are, then we go to the VORTX’s visualization for better understanding the DNA (composition of factors of each group).

Mundi

Click on the picture to play around with the map inside Google maps.

Ok, I see the clusters but know I want to know what is the combination of characteristics that unite or separate them. In the picture below is the VORTX visualization considering all groups and all factors.

Main groups

On the left side, there are the groups and their proportion. Segmentation sharpness is the measurement of the differences of groups based on all factors. On the right side is the total composition of variables or we can call the world’s DNA.

In the next figures, you will see how different it becomes when we select each group some groups.

Cluster 1

The most typical situation of a country representing 51,60.  We call them as average countries.

Cluster 2

The second most common type representing 26.46% of the globe.

Cluster 3

This is the cluster that has the so called first world countries with results are above average representing 14.89% of the globe. The United States does not belong to these group, but Canada, Australia, New Zeeland and Israel.

Cluster 4 - USA

The US is numerically so different from the rest of the world that VORTX decided to separate it alone in one group that had the highest distinctiveness = 38.93%.

United Arab Emirates

Other countries didn’t have similar countries to share the same group, this is the case of United Arab Emirates.

Before we finish, below I add the top 5 most and the 5 least influential factors that VORTX identified as the key to create the groups.

Top 5

  1. Maternal mortality ratio deaths per 100 live births 2013 – 91% influence
  2. Under-five Mortality 2013 thousand – 90%
  3. Human Development Index HDI-2014  – 90%
  4. Infant Mortality 2013 per thousands – 90%
  5. Life expectancy at birth- years – 90%

Bottom 5

  1. Renewable sources percentage of total 2012 – 70% influence
  2. Total tax revenue of GDP 2005-2013 – 72%
  3. Public health expenditure percentage of GDP 2013 73%
  4. General government final consumption expenditure – Percentual of GDP 2005-2013 73%
  5. General government final consumption expenditure – Annual growth 2005 2013 75%

Conclusions

According to VORTX if you plan to live in another country or sell your product abroad, it would be wise to see to which group this country belong to. If it belongs to the same group you live in, then you know what to expect.

Could other factors be added to removed from the analysis? Yes, absolutely. However, sometimes it is not that easy to get the information you need at the time you need it, Big Data analyses usually have several constraints and typically really on the type of questions are posed to the Data and to the algorithm that, in turn, relies on the creativity of the Data Scientist.

The clustering approach is becoming more and more common in the industry due to its strategic role in organizing and simplifying the decision-making chaos. So how could a manager look at 12.220 cells to define a regional strategy?

Any question or doubts? Or anything that calls your attention? Please leave a comment!

For those who wish to see the platform operating in practice, here is a video using data from Switzerland. Enjoy it!.

 

What is Aquarela Advanced Analytics?

Aquarela Analytics is Brazilian pioneering company and reference in the application of Artificial Intelligence in industry and large companies. With the Vortx platform and DCIM methodology, it serves important global customers such as Embraer (aerospace & defence), Scania and Randon Group (automotive), Solar Br Coca-Cola (beverages), Hospital das Clínicas (healthcare), NTS-Brasil (oil & gas), Votorantim Energia (energy), among others.

Stay tuned following Aquarela’s Linkedin!

How Titanic passengers are segmented by VORTX Big Data?

How Titanic passengers are segmented by VORTX Big Data?

To demonstrate how VORTX works, I selected a well-known dataset with information about the passengers who embarked on Titanic. Despite the tragic event, this dataset is fairly rich in details and has been widely used in Machine Learning communities since it allows the application of several Big Data techniques.

In this case, I am going to apply VORTX, which it is Big Data tool focused giving automatic segmentation plus other important decision-making indicators. This technique is called clustering. More information about this on this post (How can big data clustering strategy help business)In the conclusion section, I give some ideas on how it help businesses by means of this innovative approach.

Titanic Dataset summary

According to Encyclopedia Titanica “On 10 April 1912, the new liner sailed from Southampton, England with 2,208 passengers and crew, but four days later she collided with an iceberg and sank: 1496 people died and 712 survived”.  For this analysis the data we had access we had the following figures: 

  • 1309 people on board of which 500 survived (38%) and 809 (62%) died.
  • The average age of 29.88 years (estimated).
  • 466 women of which 127 died and 339 survived.
  • 843 man, of which 682 died and 161 survived.
  • Ticket cost on average £53.65 per woman while £76.60 for man.

For more details on the complete dataset – Google for Titanic Dataset.

Factors under analysis

Unfortunately, 267 passengers (20.39%) had to be excluded from the analysis due to missing age values. Furthermore, out of 15 factors presented in the original file, I select the numerical ones with stronger weights calculated by VORTX. Usually, we classify factors, variables or data attributes in the following 3 categories:

  • Protagonist – Factors with strong positive influence to generate a valuable pattern with clarity.
  • Antagonist – Factors with noise or unclear patterns and negative influence that play against the protagonist.
  • Supporting – Factors that do not play a significant role in changing the path of the analysis, but can enrich the results.

According to the influence power, the protagonists chosen for this analysis were:

  • Age of the passenger = 87.85%
  • How much each passenger paid to embark = 72.69%
  • Number of parents on the ship = 71.69%
  • Number of siblings or spouses on the ship = 72.42%

During the calculation the gender that indicates if the passenger was male or female tended to play an antagonist role, meaning the absence of a pattern to form the groups dropping the dataset sharpness to 7%.  Therefore, it was removed.

VORTX Results and group characteristics

After processing, VORTX resulted in the following indicators, which most of them are not offered by other algorithms, therefore, I give a brief explanation for each of them:

  • Dataset Sharpness = 33.64%. It shows how clear or confident the machine is about the discovered grouping patterns. According to our dataset quality scale, sharpness above 20% is already useful for decision making.
  • Automatic discovery of segments (groups) = 8. This is a function that makes the whole process a lot easier for the data analyst. Unlike k-means and other algorithms, VORTX finds the right (ideal) number of groups by itself reducing dramatically the segmentation errors that topically happened.
  • Clustering Distinctness = How much different the elements of each group are in relation to the overall group that makes them a group. The most distinctive one is number 5 with 51.48% (darker color) and the least one group 1 with 8.58%. This means that elements from group 5 tend to more homogeneous than the other groups.

VORTX VIEW

VORTX screenshot

By analyzing the groups and checking against the ones who survived or not the trip I came to the survival rate of each group plus the average Ticket Fare, so if you have the characteristics of the group 5 or 7 you would have better chances of surviving.  (more…)