Hello everyone,

The objective of this post is to show you what happens when we give several numbers to a machine (VORTX Big Data) and it finds out by itself how the countries should be organized into different boxes. This technique is called clustering! The questions we will answer in this post are:

  • How are countries segmented based on the world’s indexes?
  • What are the characteristics of each group?
  • Which factors are the most influential for the separation?

Here we go!

Data First – What comes in?

I have gathered 65 indexes of 188 countries of the world, the sources are mainly from:

  • UNDESA 2015,
  • UNESCO Institute for Statistics 2015,
  • United Nations Statistics Division 2015,
  • World Bank 2015,
  • IMF 2015.

Selected variables for the analysis were:

  1. Human Development Index HDI-2014
  2. Gini coefficient 2005-2013
  3. Adolescent birth rate 15-19 per 100k 20102015
  4. Birth registration under age 5 2005-2013
  5. Carbon dioxide emissions Average annual growth
  6. Carbon dioxide emissions per capita 2011 Tones
  7. Change forest percentile 1900 to 2012
  8. Change mobile usage 2009 2014
  9. Consumer price index 2013
  10. Domestic credit provided by financial sector 2013
  11. Domestic food price level 2009 2014 index
  12. Domestic food price level 2009-2014 volatility index
  13. Electrification rate or population
  14. Expected years of schooling – Years
  15. Exports and imports percentage GPD 2013
  16. Female Suicide Rate 100k people
  17. Foreign direct investment net inflows percentage GDP 2013
  18. Forest area percentage of total land area 2012
  19. Fossil fuels percentage of total 2012
  20. Freshwater withdrawals 2005
  21. Gender Inequality Index 2014
  22. General government final consumption expenditure – Annual growth 2005 2013
  23. General government final consumption expenditure – Perce of GDP 2005-2013
  24. Gross domestic product GDP 2013
  25. Gross domestic product GDP per capita
  26. Gross fixed capital formation of GDP 2005-2013
  27. Gross national income GNI per capita – 2011  Dollars
  28. Homeless people due to natural disaster 2005 2014 per million people
  29. Homicide rate per 100k people 2008-2012
  30. Infant Mortality 2013 per thousands
  31. International inbound tourists thousands 2013
  32. International student mobility of total tertiary enrolment 2013
  33. Internet users percentage of population 2014
  34. Intimate or no intimate partner violence ever experienced 2001-2011
  35. Life expectancy at birth- years
  36. Male Suicide Rate 100k people
  37. Maternal mortality ratio deaths per 100 live births 2013
  38. Mean years of schooling – Years
  39. Mobile phone subscriptions per 100 people 2014
  40. Natural resource depletion
  41. Net migration rate per 1k people 2010-2015
  42. Physicians per 10k people
  43. Population affected by natural disasters average annual per million people 2005-2014
  44. Population living on degraded land Percentage 2010
  45. Population with at least some secondary education percent 2005-2013
  46. Pre-primary 2008-2014
  47. Primary-2008-2014
  48. Primary school dropout rate 2008-2014
  49. Prison population per 100k people
  50. Private capital flows percentage GDP 2013
  51. Public expenditure on education Percentage GDP
  52. Public health expenditure percentage of GDP 2013
  53. Pupil-teacher ratio primary school pupils per teacher 2008-2014
  54. Refugees by country of origin
  55. Remittances inflows GDP 2013
  56. Renewable sources percentage of total 2012
  57. Research and development expenditure 2005-2012
  58. Secondary 2008-2014
  59. Share of seats in parliament percentage held by woman 2014
  60. Stock of immigrants percentage of population 2013
  61. Taxes on income profit and capital gain 205 2013
  62. Tertiary -2008-2014
  63. Total tax revenue of GDP 2005-2013
  64. Tuberculosis rate per thousands 2012
  65. Under-five Mortality 2013 per thousands

What comes out?

Let’s start looking at the map, where these groups are, then we go to the VORTX’s visualization for better understanding the DNA (composition of factors of each group).

Mundi

Click on the picture to play around with the map inside Google maps.

Ok, I see the clusters but know I want to know what is the combination of characteristics that unite or separate them. In the picture below is the VORTX visualization considering all groups and all factors.

Main groups

On the left side, there are the groups and their proportion. Segmentation sharpness is the measurement of the differences of groups based on all factors. On the right side is the total composition of variables or we can call the world’s DNA.

In the next figures, you will see how different it becomes when we select each group some groups.

Cluster 1

The most typical situation of a country representing 51,60.  We call them as average countries.

Cluster 2

The second most common type representing 26.46% of the globe.

Cluster 3

This is the cluster that has the so called first world countries with results are above average representing 14.89% of the globe. The United States does not belong to these group, but Canada, Australia, New Zeeland and Israel.

Cluster 4 - USA

The US is numerically so different from the rest of the world that VORTX decided to separate it alone in one group that had the highest distinctiveness = 38.93%.

United Arab Emirates

Other countries didn’t have similar countries to share the same group, this is the case of United Arab Emirates.

Before we finish, below I add the top 5 most and the 5 least influential factors that VORTX identified as the key to create the groups.

Top 5

  1. Maternal mortality ratio deaths per 100 live births 2013 – 91% influence
  2. Under-five Mortality 2013 thousand – 90%
  3. Human Development Index HDI-2014  – 90%
  4. Infant Mortality 2013 per thousands – 90%
  5. Life expectancy at birth- years – 90%

Bottom 5

  1. Renewable sources percentage of total 2012 – 70% influence
  2. Total tax revenue of GDP 2005-2013 – 72%
  3. Public health expenditure percentage of GDP 2013 73%
  4. General government final consumption expenditure – Percentual of GDP 2005-2013 73%
  5. General government final consumption expenditure – Annual growth 2005 2013 75%

Conclusions

According to VORTX if you plan to live in another country or sell your product abroad, it would be wise to see to which group this country belong to. If it belongs to the same group you live in, then you know what to expect.

Could other factors be added to removed from the analysis? Yes, absolutely. However, sometimes it is not that easy to get the information you need at the time you need it, Big Data analyses usually have several constraints and typically really on the type of questions are posed to the Data and to the algorithm that, in turn, relies on the creativity of the Data Scientist.

The clustering approach is becoming more and more common in the industry due to its strategic role in organizing and simplifying the decision-making chaos. So how could a manager look at 12.220 cells to define a regional strategy?

Any question or doubts? Or anything that calls your attention? Please leave a comment!

For those who wish to see the platform operating in practice, here is a video using data from Switzerland. Enjoy it!.

VORTX Big Data

Aquarela developed VORTX Big Data to make predictive analytics a lot easier, more precise and more robust than current solutions on the market with significant impact on business problems such as: Churn reduction, business scenarios discovery, predictive maintenance, market segmentation and healthcare resource optimization.

Autores
Marcos Santos
Founder of Aquarela, CEO and architect of the VORTX platform. Master in Engineering and Knowledge Management, enthusiast of new technologies, having expertise in Scala functional language and algorithms of Machine Learning and IA.

Joni Hoppen
Founder of Aquarela, professor and lecturer in the area of Data Science, master in Information Systems, focused on processes of rapid prototyping of Big Data Analytics and data culture.

Informações para referenciação: Gostou do material? Caso queira enriquecer sua pesquisa ou relatório (seja blog post ou artigo acadêmico), referencie nosso conteúdo como: Aquarela 2018 - Inteligência Artificial para negócios (www.aquare.la).