What are outliers and how to treat them in Data Analytics?

What are outliers and how to treat them in Data Analytics?

What are Outliers? They are data records that differ dramatically from all others, they distinguish themselves in one or more characteristics. In other words, an outlier is a value that escapes normality and can (and probably will) cause anomalies in the results obtained through algorithms and analytical systems. There, they always need some degrees of attention.

Understanding the outliers is critical in analyzing data for at least two aspects:

  1. The outliers may negatively bias the entire result of an analysis;
  2. the behavior of outliers may be precisely what is being sought.

While working with outliers, many words can represent them depending on the context. Some other names are: Aberration, oddity, deviation, anomaly, eccentric, nonconformist, exception, irregularity, dissent, original and so on. Here are some common situations in which outliers arise in data analysis and suggest best approaches on how to deal with them in each case.

How to identify which record is outlier?

Find the outliers using tables

The simplest way to find outliers in your data is to look directly at the data table or worksheet – the dataset, as data scientists call it. The case of the following table clearly exemplifies a typing error, that is, input of the data. The field of the individual’s age Antony Smith certainly does not represent the age of 470 years. Looking at the table it is possible to identify the outlier, but it is difficult to say which would be the correct age. There are several possibilities that can refer to the right age, such as: 47, 70 or even 40 years.

Antony Smith age outlier
Antony Smith age outlier

In a small sample the task of finding outliers with the use of tables can be easy. But when the number of observations goes into the thousands or millions, it becomes impossible. This task becomes even more difficult when many variables (the worksheet columns) are involved. For this, there are other methods.

Find outliers using graphs

One of the best ways to identify outliers data is by using charts. When plotting a chart the analyst can clearly see that something different exists. Here are some examples that illustrate the view of outliers with graphics.

Case: outliers in the Brazilian health system

In a study already published on Aquarela’s website, we analyzed the factors that lead people no-show in medical appointments scheduled in the public health system of the city of Vitória in the state of Espirito Santo, which caused and approximate loss of 8 million US dollars a year million.  

In the dataset, several patterns have been found, for example: children are practically not missing their appointments; and women attend consultations much more than men. However, a curious case was that of an outlier, who at age 79 scheduled a consultation 365 days in advance and actually showed up in her appointment.

This is a case, for example, of a given outlier that deserves to be studied, because the behavior of this lady can bring relevant information of measures that can be adopted to increase the rate of attendance in the schedules. See the case in the chart below.

sample of 8000 appointments
sample of 8000 appointments

Case: outliers in the Brazilian financial market

On May 17, 2017 Petrobras shares fell 15.8% and the stock market index (IBOVESPA) fell 8.8% in a single day. Most of the shares of the Brazilian stock exchange saw their price plummet on that day. This strong negative variation had as main motivation the Joesley Batista, one of the most shocking political events that happened in the first half of 2017.

This case represents an outlier for the analyst who, for example, wants to know what was the average daily return on Petrobrás shares in the last 180 days. Certainly, the Joesley’ facts strongly affected the average down. In analyzing the chart below, even in the face of several observations, it is easy to identify the point that disagrees with the others.

Petrobras 2017

The data of the above example may be called outlier, but if taken literally, it can not necessarily be considered a “outlier.” The “curve” in the above graph, although counter-intuitive, is represented by the straight line that cuts the points. Still from the graph above you can see that although different from the others, the data is not exactly outside the curve.

A predictive model could easily infer with high precision that a 9% drop in the stock market index would represent a 15% drop in Petrobras’ share price. In another case, still with data from the Brazilian stock market, the stock of the company Magazine Luiza appreciated 30.8% on a day when the stock market index rose by only 0.7%. This data, besides being an atypical point, distant from the others, also represents an outlier. See the chart:

This is an outlier case that can harm not only descriptive statistics calculations, such as the mean and median, for example, but it also affects the calibration of predictive models.

Find outliers using statistical methods

A more complex but quite precise way of finding outliers in a data analysis is to find the statistical distribution that most closely approximates the distribution of the data and to use statistical methods to detect discrepant points. The following example represents the histogram of the known driver metric “kilometers per liter”.

The dataset used for this example is a public dataset greatly exploited in statistical tests by data scientists. The dataset contains “Motor Trend US magazine” of 1974 and comprises several aspects about the performance of 32 models. More details at this link.

The histogram is one of the main and simplest graphing tools for the data analyst to use in understanding the behavior of the data.

In the histogram below, the blue line represents what the normal (Gaussian) distribution would be based on the mean, standard deviation and sample size, and is contrasted with the histogram in bars.

The red vertical lines represent the units of standard deviation. It can be seen that cars with outlier performance for the season could average more than 14 kilometers per liter, which corresponds to more than 2 standard deviations from the average.

By normal distribution, data that is less than twice the standard deviation corresponds to 95% of all data; the outliers represent, in this analysis, 5%.

Outliers in clustering

In this video in English (with subtitles) we present the identification of outliers in a visual way using a visual clustering process with national flags.

Conclusions: What to do with outliers?

We have seen  that it is imperative to pay attention to outliers because they can bias data analysis. But, in addition to identifying outliers we suggest some ways to better treat them:

  • Exclude the discrepant observations from the data sample: when the discrepant data is the result of an input error of the data, then it needs to be removed from the sample;
  • perform a separate analysis with only the outliers: this approach is useful when you want to investigate extreme cases, such as students who only get good grades, companies that make a profit even in times of crisis, fraud cases, among others. use clustering methods to find an approximation that corrects and gives a new value to the outliers data.
  • in cases of data input errors, instead of deleting and losing an entire row of records due to a single outlier observation, one solution is to use clustering algorithms that find the behavior of the observations closest to the given outlier and make inferences of which would be the best approximate value.

Finally, the main conclusion about the outliers can be summarized as follows:

“a given outlier may be what most disturbs his analysis, but may also be exactly what you are looking for.”

What is Aquarela Advanced Analytics?

Aquarela Analytics is Brazilian pioneering company and reference in the application of Artificial Intelligence in industry and large companies. With the Vortx platform and DCIM methodology, it serves important global customers such as Embraer (aerospace & defence), Scania and Randon Group (automotive), Solar Br Coca-Cola (beverages), Hospital das Clínicas (healthcare), NTS-Brasil (oil & gas), Votorantim Energia (energy), among others.

Stay tuned following Aquarela’s Linkedin!

Authors

Human Resources Optimised with Advanced Analytics

Human Resources Optimised with Advanced Analytics

Today we are going to present some insights related to employee’s working the satisfaction using Advanced Analytics tools and techniques. As a source for this study, we make use of the data made available on this link by the data scientist Ludovic Benistant who made important anonymizations. Some pictures have Brazilian Portuguese words, sorry about that! Let’s go!

Research Questions

Following the DCIM (Data Culture Introduction Methodology) methodology to guide this research, we came up the following questions:

  • What factors have the greatest influence on employee satisfaction?
  • What are the main satisfaction scenarios that exist?
  • What are the main patterns associated with key satisfaction scenarios?
  • What factors influence professionals to leave?

Data Characteristics

In total, 14,999 employees were evaluated, considering the following variables already sanitized by our scripts:

  • Employee satisfaction level (0 to 10) – Probably filled out by the employee;
  • Last evaluation (0 to 10) – Probably filled in by a manager;
  • Number of projects (2 to 7) – Number of projects in which the employee acted;
  • Average monthly hours (96 to 310);
  • Time spent at the company (2 to 10) – How long the person already worked in the company;
  • Whether they have had an accident at work – (Yes = 1 / No = 0);
  • Whether they have had a promotion in the last 5 years (Yes = 1 / No = 0);
  • Salary Range (Low = 1, Medium = 2, High = 3); Note: Actual values were not made available.
  • Left the company (Yes = 1 / No = 0).

Number of people per department

 

 


per-departament

 

Frequency Analysis / Distribution of Satisfaction

overal-satisfaction-level

The highest concentration of satisfaction is within the range of 7 to 9, and there are few people with satisfaction scores between 1.5 and 3.0.

Results

Ranking of Influence Factors in Work Satisfaction

By processing this dataset on VORTX Big Data algorithm

  1. Average monthly hours (50)
  2. Time spent at the company (21)
  3. Number of projects (20)
  4. Salary Range (13)
  5. Left the company (10)
  6. Whether they have had a promotion in the last 5 years (9)
  7. Whether they have had accident at work (9)

The factor “Last evaluation” had no relevant influence and it was automatically discarded by VORTX.

Satisfaction Scenarios

In the table below we have the result of the processing with the separation of employees into groups done automatically by the platform. In all, 120 groups have been found, and here we will focus on only the 20 most relevant and leave the others out as isolated cases and not the focus of the analysis.

english-table

Model Visual Validation

Typically managers, as far as we have experienced,  are not sure regarding machine’s ability automate the discovery of insights. Therefore, as proof of the model, we chose to show the raw data visually to demonstrate the insights aforementioned.

grupo-9-o-mais-insatisfeitos

The pattern of hours worked by the 588 people in scenario 9 (very dissatisfied). X Axis = Monthly working hours.

 

grupo-1

The pattern of hours worked in the largest scenario (1), which has 4085 employees, a good job satisfaction and a low level of job evasion. X Axis – Monthly working hours

In the view below, each circle represents a contributor in four dimensions:

  • The level of satisfaction on the Y axis.
  • Average hours per month on the X axis.
  • Orange colors for people who left the company and blue for those who remain.
  • Circle size represents the number of years in the company.

general-pattern

Alright, we just saw the overall pattern including the whole organization, so what would happen if we see it by the department?

accounting-and-it

managment-to-product

rd-and-support

technical

Conclusions and Recommendations

This study shed some light on the improvement of human resource management, which is at the heart of today’s businesses. Applying data analytics algorithms in this area allows automating and accelerating the process of pattern discovery in complex environments with, let’s say 50 variables or more. Here it was just a few. Meanwhile, the search for patterns in a traditional BI continues to be a purely artisanal work with a well know imitation of 4 dimensions per attempt (read more on this at Understanding the differences between BI, Big Data and Data Mining). The automation of discovery is an extremely important step in predictive analytics, in this case, the evasion of highly qualified professionals and possible dissatisfactions overlooked by management.

With VORTX’s ability to discover the different scenarios, we were able to analyze the data and conclude that:

  • People in group 1 and 2 (55% of the company) have a reasonable work satisfaction with a weekly load of 50 hours on average, without receiving promotion or suffering an accident at work.
  • The pattern persists in all departments.
  • The most satisfied groups of the 20 largest were the 7 and 10 who worked more than 247 hours a month, took on several projects but as they did not receive promotion they left the company. These people should be retained since there seams to be highly qualified.
  • Group 16 proves that it is possible to earn a good salary and be dissatisfied. These 77 people should be interviewed to identify the root cause of such unsatisfaction.
  • The cut-off line for non-company employees is: minimum 170 and maximum 238 hours worked per month.People with more than 3.5 years of work harder and are more satisfied.
  • Monthly hours above 261 resulted in very low levels of satisfaction.
  • Monthly hours below 261 with a number of projects greater than 3 turns out in high job satisfaction.
  • Scenario 15 shows the importance of promotion over the last 5 years of work.
  • The ones with more than 5 projects decrease their satisfaction, the ideal number is between 3 and 5. Of course, in this case, to better understand the indicator is necessary to better understand what the number of projects represents to different departments.

For managers, collecting as many indicators as possible is always good especially without interruption in all areas. More variables to enrich your model would be:

  • The distance between employee’s home and work.
  • The average time that is taken from home to work.
  • The number of children.
  • The number of phone calls or emails sent and received.
  • Gender and age and the reason for leaving the job.

We hope this information is useful for you guys in some way. If you find it relevant, share it with your colleagues. If in doubt, contact us! A big hug and success in developing your own HR strategy!

 

What is Aquarela Advanced Analytics?

Aquarela Analytics is Brazilian pioneering company and reference in the application of Artificial Intelligence in industry and large companies. With the Vortx platform and DCIM methodology, it serves important global customers such as Embraer (aerospace & defence), Scania and Randon Group (automotive), Solar Br Coca-Cola (beverages), Hospital das Clínicas (healthcare), NTS-Brasil (oil & gas), Votorantim Energia (energy), among others.

Stay tuned following Aquarela’s Linkedin!

How VORTX Big Data organises the world?

How VORTX Big Data organises the world?

Hello everyone,

The objective of this post is to show you what happens when we give several numbers to a machine (VORTX Big Data) and it finds out by itself how the countries should be organized into different boxes. This technique is called clustering! The questions we will answer in this post are:

  • How are countries segmented based on the world’s indexes?
  • What are the characteristics of each group?
  • Which factors are the most influential for the separation?

Here we go!

Data First – What comes in?

I have gathered 65 indexes of 188 countries of the world, the sources are mainly from:

  • UNDESA 2015,
  • UNESCO Institute for Statistics 2015,
  • United Nations Statistics Division 2015,
  • World Bank 2015,
  • IMF 2015.

Selected variables for the analysis were:

  1. Human Development Index HDI-2014
  2. Gini coefficient 2005-2013
  3. Adolescent birth rate 15-19 per 100k 20102015
  4. Birth registration under age 5 2005-2013
  5. Carbon dioxide emissions Average annual growth
  6. Carbon dioxide emissions per capita 2011 Tones
  7. Change forest percentile 1900 to 2012
  8. Change mobile usage 2009 2014
  9. Consumer price index 2013
  10. Domestic credit provided by financial sector 2013
  11. Domestic food price level 2009 2014 index
  12. Domestic food price level 2009-2014 volatility index
  13. Electrification rate or population
  14. Expected years of schooling – Years
  15. Exports and imports percentage GPD 2013
  16. Female Suicide Rate 100k people
  17. Foreign direct investment net inflows percentage GDP 2013
  18. Forest area percentage of total land area 2012
  19. Fossil fuels percentage of total 2012
  20. Freshwater withdrawals 2005
  21. Gender Inequality Index 2014
  22. General government final consumption expenditure – Annual growth 2005 2013
  23. General government final consumption expenditure – Perce of GDP 2005-2013
  24. Gross domestic product GDP 2013
  25. Gross domestic product GDP per capita
  26. Gross fixed capital formation of GDP 2005-2013
  27. Gross national income GNI per capita – 2011  Dollars
  28. Homeless people due to natural disaster 2005 2014 per million people
  29. Homicide rate per 100k people 2008-2012
  30. Infant Mortality 2013 per thousands
  31. International inbound tourists thousands 2013
  32. International student mobility of total tertiary enrolment 2013
  33. Internet users percentage of population 2014
  34. Intimate or no intimate partner violence ever experienced 2001-2011
  35. Life expectancy at birth- years
  36. Male Suicide Rate 100k people
  37. Maternal mortality ratio deaths per 100 live births 2013
  38. Mean years of schooling – Years
  39. Mobile phone subscriptions per 100 people 2014
  40. Natural resource depletion
  41. Net migration rate per 1k people 2010-2015
  42. Physicians per 10k people
  43. Population affected by natural disasters average annual per million people 2005-2014
  44. Population living on degraded land Percentage 2010
  45. Population with at least some secondary education percent 2005-2013
  46. Pre-primary 2008-2014
  47. Primary-2008-2014
  48. Primary school dropout rate 2008-2014
  49. Prison population per 100k people
  50. Private capital flows percentage GDP 2013
  51. Public expenditure on education Percentage GDP
  52. Public health expenditure percentage of GDP 2013
  53. Pupil-teacher ratio primary school pupils per teacher 2008-2014
  54. Refugees by country of origin
  55. Remittances inflows GDP 2013
  56. Renewable sources percentage of total 2012
  57. Research and development expenditure 2005-2012
  58. Secondary 2008-2014
  59. Share of seats in parliament percentage held by woman 2014
  60. Stock of immigrants percentage of population 2013
  61. Taxes on income profit and capital gain 205 2013
  62. Tertiary -2008-2014
  63. Total tax revenue of GDP 2005-2013
  64. Tuberculosis rate per thousands 2012
  65. Under-five Mortality 2013 per thousands

What comes out?

Let’s start looking at the map, where these groups are, then we go to the VORTX’s visualization for better understanding the DNA (composition of factors of each group).

Mundi

Click on the picture to play around with the map inside Google maps.

Ok, I see the clusters but know I want to know what is the combination of characteristics that unite or separate them. In the picture below is the VORTX visualization considering all groups and all factors.

Main groups

On the left side, there are the groups and their proportion. Segmentation sharpness is the measurement of the differences of groups based on all factors. On the right side is the total composition of variables or we can call the world’s DNA.

In the next figures, you will see how different it becomes when we select each group some groups.

Cluster 1

The most typical situation of a country representing 51,60.  We call them as average countries.

Cluster 2

The second most common type representing 26.46% of the globe.

Cluster 3

This is the cluster that has the so called first world countries with results are above average representing 14.89% of the globe. The United States does not belong to these group, but Canada, Australia, New Zeeland and Israel.

Cluster 4 - USA

The US is numerically so different from the rest of the world that VORTX decided to separate it alone in one group that had the highest distinctiveness = 38.93%.

United Arab Emirates

Other countries didn’t have similar countries to share the same group, this is the case of United Arab Emirates.

Before we finish, below I add the top 5 most and the 5 least influential factors that VORTX identified as the key to create the groups.

Top 5

  1. Maternal mortality ratio deaths per 100 live births 2013 – 91% influence
  2. Under-five Mortality 2013 thousand – 90%
  3. Human Development Index HDI-2014  – 90%
  4. Infant Mortality 2013 per thousands – 90%
  5. Life expectancy at birth- years – 90%

Bottom 5

  1. Renewable sources percentage of total 2012 – 70% influence
  2. Total tax revenue of GDP 2005-2013 – 72%
  3. Public health expenditure percentage of GDP 2013 73%
  4. General government final consumption expenditure – Percentual of GDP 2005-2013 73%
  5. General government final consumption expenditure – Annual growth 2005 2013 75%

Conclusions

According to VORTX if you plan to live in another country or sell your product abroad, it would be wise to see to which group this country belong to. If it belongs to the same group you live in, then you know what to expect.

Could other factors be added to removed from the analysis? Yes, absolutely. However, sometimes it is not that easy to get the information you need at the time you need it, Big Data analyses usually have several constraints and typically really on the type of questions are posed to the Data and to the algorithm that, in turn, relies on the creativity of the Data Scientist.

The clustering approach is becoming more and more common in the industry due to its strategic role in organizing and simplifying the decision-making chaos. So how could a manager look at 12.220 cells to define a regional strategy?

Any question or doubts? Or anything that calls your attention? Please leave a comment!

For those who wish to see the platform operating in practice, here is a video using data from Switzerland. Enjoy it!.

 

What is Aquarela Advanced Analytics?

Aquarela Analytics is Brazilian pioneering company and reference in the application of Artificial Intelligence in industry and large companies. With the Vortx platform and DCIM methodology, it serves important global customers such as Embraer (aerospace & defence), Scania and Randon Group (automotive), Solar Br Coca-Cola (beverages), Hospital das Clínicas (healthcare), NTS-Brasil (oil & gas), Votorantim Energia (energy), among others.

Stay tuned following Aquarela’s Linkedin!

How can Big Data clustering strategy help business?

How can Big Data clustering strategy help business?

Hello folks,

To clarify the concept of clustering, which is a reoccurring theme in machine learning area (machine learning), we made a video tutorial that demonstrates a clustering problem that we can solve visually, then with finalize with a real case and some conclusions.  It is important to mention that other areas may benefit from this technique by targeting markets where you can meet different audiences according to their characteristics. We will use a video example

Below is the description of the video for those who like reading.

To facilitate the absorption of the concept, we will use a visual-based example. So, imagine that you have a textile factory and you want to produce as many flags as possible in the shortest time as with fewer materials as possible. Considering that there are around 200 national flags and each has different colors and shapes, we are interested to know which color patterns and shapes exist to optimize and organize the production line. That’s the idea, reduce costs and time while maintaining quality and volume.

All flags

Figure 1 – Representation of raw data without patterns detected

A good clustering algorithm should be able to identify patterns out of the raw data like we humans can visually identify looking at the Italian, Irish and Mexican flags like in the example below.  One factor that differentiates clustering algorithms from the classifying algorithms is that they have no hints about the patterns to study the model they must figure out automatically and this is a big challenge for practitioners.

bandeiras1

Figure 2: Cluster zero (0) composed of the Italian, Irish and the Mexican flags.

In this context, as important as to identify groups with similarities between each other and finding individuals who do not resemble any other element. The so-called outliers, which are the exceptions.

bandeiras2

Figure 3: Cluster six (6) composed of the flag of Nepal. An exception.

Finally, as the result of a good clustering process, we have the groups formed by the flags that have similar features and isolated individuals being the outliers.

bandeiras3

Figure 3: Clusters formed at the end of visual human-based processing.

One of the most important factors of clustering is the number of groups where the elements will be allocated. In many cases, we have observed very different results while applying the same data, and same parameterization in different algorithms. This is very important. See below what could be the result of an inaccurate clustering.

bandeiras4

Figure 4: Clusters result of a wrong clusterization

So, a practical question is:

Would you invest your money in this?

Probably not, and solving this problem is our challenge. A real application that we carried out was to identify the main characteristics of patients who don’t show up to their medical appointments, the well-known no-show problem that has deep implications in offices, clinics, and hospitals. The result was an amazing group with 50% of the analyzed data, which really deserves a specific policy. Doesn’t this give reason to the chief financial officers of these organizations?

Other possible applications of the clustering strategy were presented in this post “14 sectors for application of Big Data and data necessary for analysis.”

Some conclusions

  • Our vision is very powerful clustering images as in the case of flags.
  • It is humanly impossible to do analysis and logical correlations of numbers from a large database, so the clustering algorithms were created.
  • The accuracy of the results of clustering is crucial for making investment decisions.
  • Several sectors can benefit from this management approach.

Thank you!

 

What is Aquarela Advanced Analytics?

Aquarela Analytics is Brazilian pioneering company and reference in the application of Artificial Intelligence in industry and large companies. With the Vortx platform and DCIM methodology, it serves important global customers such as Embraer (aerospace & defence), Scania and Randon Group (automotive), Solar Br Coca-Cola (beverages), Hospital das Clínicas (healthcare), NTS-Brasil (oil & gas), Votorantim Energia (energy), among others.

Stay tuned following Aquarela’s Linkedin!