What are Outliers? They are data records that differ dramatically from all others, they distinguish themselves in one or more characteristics. In other words, an outlier is a value that escapes normality and can (and probably will) cause anomalies in the results obtained through algorithms and analytical systems. There, they always need some degrees of attention.
Understanding the outliers is critical in analyzing data for at least two aspects:
The outliers may negatively bias the entire result of an analysis;
the behavior of outliers may be precisely what is being sought.
While working with outliers, many words can represent them depending on the context. Some other names are: Aberration, oddity, deviation, anomaly, eccentric, nonconformist, exception, irregularity, dissent, original and so on. Here are some common situations in which outliers arise in data analysis and suggest best approaches on how to deal with them in each case.
How to identify which record is outlier?
Find the outliers using tables
The simplest way to find outliers in your data is to look directly at the data table or worksheet – the dataset, as data scientists call it. The case of the following table clearly exemplifies a typing error, that is, input of the data. The field of the individual’s age Antony Smith certainly does not represent the age of 470 years. Looking at the table it is possible to identify the outlier, but it is difficult to say which would be the correct age. There are several possibilities that can refer to the right age, such as: 47, 70 or even 40 years.
In a small sample the task of finding outliers with the use of tables can be easy. But when the number of observations goes into the thousands or millions, it becomes impossible. This task becomes even more difficult when many variables (the worksheet columns) are involved. For this, there are other methods.
Find outliers using graphs
One of the best ways to identify outliers data is by using charts. When plotting a chart the analyst can clearly see that something different exists. Here are some examples that illustrate the view of outliers with graphics.
Case: outliers in the Brazilian health system
In a study already published on Aquarela’s website, we analyzed the factors that lead people no-show in medical appointments scheduled in the public health system of the city of Vitória in the state of Espirito Santo, which caused and approximate loss of 8 million US dollars a year million.
In the dataset, several patterns have been found, for example: children are practically not missing their appointments; and women attend consultations much more than men. However, a curious case was that of an outlier, who at age 79 scheduled a consultation 365 days in advance and actually showed up in her appointment.
This is a case, for example, of a given outlier that deserves to be studied, because the behavior of this lady can bring relevant information of measures that can be adopted to increase the rate of attendance in the schedules. See the case in the chart below.
Case: outliers in the Brazilian financial market
On May 17, 2017 Petrobras shares fell 15.8% and the stock market index (IBOVESPA) fell 8.8% in a single day. Most of the shares of the Brazilian stock exchange saw their price plummet on that day. This strong negative variation had as main motivation the Joesley Batista, one of the most shocking political events that happened in the first half of 2017.
This case represents an outlier for the analyst who, for example, wants to know what was the average daily return on Petrobrás shares in the last 180 days. Certainly, the Joesley strongly affected the average down. In analyzing the chart below, even in the face of several observations, it is easy to identify the point that disagrees with the others.
The data of the above example may be called outlier, but if taken literally, it can not necessarily be considered a “outlier.” The “curve” in the above graph, although counter-intuitive, is represented by the straight line that cuts the points. Still from the graph above you can see that although different from the others, the data is not exactly outside the curve.
A predictive model could easily infer with high precision that a 9% drop in the stock market index would represent a 15% drop in Petrobras’ share price. In another case, still with data from the Brazilian stock market, the stock of the company Magazine Luiza appreciated 30.8% on a day when the stock market index rose by only 0.7%. This data, besides being an atypical point, distant from the others, also represents an outlier. See the chart:
This is an outlier case that can harm not only descriptive statistics calculations, such as the mean and median, for example, but it also affects the calibration of predictive models.
Find outliers using statistical methods
A more complex but quite precise way of finding outliers in a data analysis is to find the statistical distribution that most closely approximates the distribution of the data and to use statistical methods to detect discrepant points. The following example represents the histogram of the known driver metric “kilometers per liter”.
The dataset used for this example is a public dataset greatly exploited in statistical tests by data scientists. The dataset contains “Motor Trend US magazine” of 1974 and comprises several aspects about the performance of 32 models. More details at this link.
The histogram is one of the main and simplest graphing tools for the data analyst to use in understanding the behavior of the data.
In the histogram below, the blue line represents what the normal (Gaussian) distribution would be based on the mean, standard deviation and sample size, and is contrasted with the histogram in bars.
The red vertical lines represent the units of standard deviation. It can be seen that cars with outlier performance for the season could average more than 14 kilometers per liter, which corresponds to more than 2 standard deviations from the average.
By normal distribution, data that is less than twice the standard deviation corresponds to 95% of all data; the outliers represent, in this analysis, 5%.
Outliers in clustering
In this video in English (with subtitles) we present the identification of outliers in a visual way using a visual clustering process with national flags.
Conclusions: What to do with outliers?
We have seen that it is imperative to pay attention to outliers because they can bias data analysis. But, in addition to identifying outliers we suggest some ways to better treat them:
Exclude the discrepant observations from the data sample: when the discrepant data is the result of an input error of the data, then it needs to be removed from the sample;
perform a separate analysis with only the outliers: this approach is useful when you want to investigate extreme cases, such as students who only get good grades, companies that make a profit even in times of crisis, fraud cases, among others. use clustering methods to find an approximation that corrects and gives a new value to the outliers data.
in cases of data input errors, instead of deleting and losing an entire row of records due to a single outlier observation, one solution is to use clustering algorithms that find the behavior of the observations closest to the given outlier and make inferences of which would be the best approximate value.
Finally, the main conclusion about the outliers can be summarized as follows:
“a given outlier may be what most disturbs his analysis, but may also be exactly what you are looking for.”
Who is Aquarela Advanced Analytics?
Aquarela Analytics is a pioneer and national reference in the application of Artificial Intelligence in industry. With the Vortx platform and DCIM methodology, it serves important Brazilian customers such as Embraer, Randon Group, Solar Br Coca-cola and others.
Stay tuned for new Aquarela Analytics publications on Linkedin!
Founder of Aquarela and Director of Digital Expansion, Master in Business Information Technology at University of Twente – The Netherlands. Professor and lecturer in the area of Data Science, specialist in intelligence systems architecture and new business development for industry.
The way to conduct financial analysis is changing fast. In the last two decades, companies generally have undergone an intense process of computerization initiated by the field of accounting, with the use of management systems such as ERPs and CRMs. Today they produce much more ta anata than ever before and this equity needs to be analyzed from both the finance and investment standpoint as well as Data Analytics and Advanced Analytics.
In this article we will briefly compare the main changes that are occurring in the way financial analysts work with regard to their future in relation to the area of Advanced Analytics.
Key motivators of changes in how to do financial analysis
The connectivity of recent years has generated new business models that could never before be imagined, being able to serve varied audiences 24 hours a day and with an unprecedented scalability in history, such as Uber for example. In addition, the volume of data has grown in size and complexity, creating a potential for insights and transformations in business that can be sure of purely financial analysis, done only by conventional methods.
Traditional Financial Analysis and Advanced Analytics Techniques
Financial analysis has well-established and widespread analysis methods. To say whether or not a company is interesting for business or investment is a task that is often satisfied by the analysis of past accounting / financial indicators. To do this, each analyst has specific criteria to evaluate the economic and financial feasibility of new investments, serving both corporate finance and personal finance.
Data Analytics methods, in turn, can be used to automate and optimize financial decisions, according to methods that are already used in the area. However, it is also possible, when using Advanced Analytics techniques, to incorporate machine learning and artificial intelligence algorithms to develop predictions in an innovative and yet unexplored way in the market: this is what will generate a great competitive advantage for companies in the financial market. industry 4.0.
Data Analytics and Advanced Analytics in practice :
Automate and optimize financial analytics with Data Analytics:
creation of datasets with automatic updating with macroeconomic data (basic interest rate, inflation, GDP, among others);
collecting data from financial statements in an automated way, either from companies open via API of already existing databases, or from closed companies, through extraction of tables of PDF files, for example;
creation of automated descriptive reports on the main changes in the macroeconomic scenario.
Make predictions with Advanced Analytics – machine learning, artificial intelligence, among others:
models adaptable to changes in economic and financial reality, capable of making recommendations and indicating directions for decision making.
If the financial analyst uses spreadsheets, such as Excel, to do his analyzes, he can then optimize the data extraction and cleaning processes with Data Analytics techniques and in the end obtain an output from an Excel spreadsheet so that he can work and perform the financial analyzes you are already accustomed to do. However, the great competitive advantage lies in the hands of analysts who can use Advanced Analytics to transform the way in which they perform their own financial analysis.
The area of finance is also strongly influenced by the use of econometric methods to make forecasts. However, the use of conventional econometric models usually refers to models that are static. Several tests of robustness are usually made to validate such models, but the problem is that much of it is not adaptable to changes in economic and financial reality, typical situation due to the dynamism of financial markets. This versatility and adaptability to change are characteristics of models that use machine learning and artificial intelligence techniques in a coherent implementation of the data analytics culture among financial analysts.
The Data Analytics culture presents a different way of acquiring analytical knowledge from the traditional model. Achieving Analytics knowledge is more decentralized by the effect of the internet and the sharing of programming codes in package form (influence of computer science and versioning techniques). That is, instead of the analyst spending months or even years creating all the calculations in an isolated way in an Excel spreadsheet to reach a conclusion, with the culture of Data Analytics it is possible to import complete sets of codes that perform complex analyzes on the data in minutes, greatly speeding up the process.
To get an idea of the growth of this type of approach in problem solving, we present below the volume of packages added to the main repository of R language packs – CRAN.
The possibilities become so broad in this new mode that, in a few seconds, it is possible to install and execute commands for automatic generation of Internet Memes, like this one, with only 4 command lines.
Comparison of traditional methods of financial analysis, Data Analytics and Advanced Analytics Typically, traditional financial analysis methods include stable, well-judged valuations without the need for presentation or discussion of the methods used. Already in Analytics methods, communities share codes and tools, not just concepts. See the table below for a comparison of the two approaches.
Replication level and analysis speed
Low, since each worksheet is auto contained and changes are not shared
Intermediate, using good practices of scripting and collaborative work. Spreadsheets in Dataset format.
High, using structured systems to operate in a distributed way. Multiplatform of scalar form.
Use of Artificial Intelligence
Trend analysis with strong use of temporal series, use of regression methods. In general, these are robust but static models.
Predictions with statistical weights of all variables analyzed, with a wide range of generic algorithms available for analysis
Continuous improvement of accuracy, speed and assertiveness of predictive models with weights in all variables discovered by the algorithms themselves.
Internal financial health data of the organization, comparison with similar organizations. Macroeconomic analyzes made on the basis of theoretical premises.
Internal data, data linked to macroeconomic aspects, analysis of texts (such as minutes and explanatory notes), investigation of relationships also with non-financial data.
Internal and external data at various levels of granularity.
Statistical and econometric software, such as: SPSS, Eviews, Stata.
R, Python or other specific programming notebooks, Data cleaning tools, data mining algorithm suites.
Pure text editors (example: Sublime)
Git – Code versioning and creative artifacts
Machine learning platforms and artificial intelligence, which contemplate the use of several algorithms. Use of distributed computing platforms, such as Spark and Hadoop.
Closed Code Tools
Open Code Tools
Mix of open and closed code tools
Analyst main activities
Analysis of financial statements and indicators. Development of economic / financial reports.
Definition of financial analysis structures, preparation of Datasets, information flow of the indicators that compose the datasets. Not limited to financial indicators.
Implantation of large-scale models in an integrated way to the transactional systems.
Final considerations and recommendations
The deeper impact of shifting the profile of financial analysis to Analytics paradigms occurs in the nature of the work of financial analysts, which becomes oriented to package orchestration and data flow through scripts, with less technical dependence on the IT sectors and Development.
For those who work in the area of financial analysis and intends to adapt to new market trends, increasingly based on data, we recommend an in-depth study of the basic packages of programming languages (mainly R and Python), how to use code versioning methods (such as Git or Github), participate in Data Science best practices communities in your region, or even online communities.
Industry 4.0 is characterized by the change on the flow of value from centrally designed and resource-intensive products to knowledge-intensive decentralized services designed and produced with strong support from Advanced Analytics and IA throughout digital transformation.
This process has its beginning with the Internet boom in the first decade of the millennium. 2018 seems to be the year of emancipation of Industry 4.0; which ceases to exist only in scientific articles and laboratories, evolving with vigorous support from the budgets of the largest corporations in the world, according to research by the OECD, Gartner Group and PWC.
From our point of view, the Industry 4.0 is materialized from the concepts of Web 3.0, whose core lies in the democratization of the capacity for action and knowledge (as already discussed in this blog post). But before we get to 4.0, let’s understand their previous versions in perspective:
Characterized by the discovery of economic gains by producing something in series rather than artisanal (individual) production, making it possible to mechanize labor, which was previously only performed by people or animals. It was the moment when man began to use the force of the waters, winds and also of the fire, from the steam engines and mills. In 1776 Adam Smith (The Wealth of Nations) presents the advantages of segmenting work in a pin factory. (know more)
Key Components – Coal and Steam Engines.
Its major driver was the electricity that, from generators, motors and artificial lighting, allowed to establish the assembly lines, and thus was given the mass production of consumer goods.
Key Components – Electricity and Electromechanical Machines
Characterized by automation, its driving force is the use of robots and computers in the optimization of production lines.
Key Components: Computers and Robots
Industry 4.0 is characterized by the strong automation of the design, manufacturing and distribution stages of goods and services with strong use of CI – Collective Intelligence – and AI – Artificial Intelligence. In Industry 4.0, with the evolution of the Web, individuals are increasingly empowered by their agents (smartphones). Giving up to the needs of this new consumer is one of the great challenges of the new industry.
To better illustrate this concept we created the following table:
Before industry age
Use of electric, thermic, hydraulic energy
Electric energy as a main driver, assembly line process start
People using machines (computers) as assistants
People and machines
Use of automation (robots and computers)
Collective inteligente + machines
Collective inteligente + machines
Use of computacional and collective inteligence to create products and services
In order to understand Industry 4.0 it is important to clarify some concepts that make up its foundations: AI – Artificial Intelligence and CI – Collective Intelligence.
Let’s start with IC, which is more tangible, since we constantly use mechanisms that use collective intelligence in the production and curation of content such as: Wikipedia, Facebook, Waze and Youtube.
Wikipedia: For example, most of all Wikipedia content is produced by hundreds of thousands of publishers worldwide and cured by millions of users who validate and review their content.
Waze: The Waze application uses users’ own movement to build and refine their maps, providing real-time alternative routes to escape traffic congestion and new routes of new sections created by cities.
Facebook and Youtube are services that today have a diverse range of content that is spontaneously generated and cured by its users throughout likes and shares.
What do these mechanisms have in common? They rely on the so-called intelligence of the masses, a concept established by the Marquis de Condorcet in 1785, which defines a degree of certainty and uncertainty about a decision from a collective of individuals.
With hundreds or thousands of individuals acting in their own way, by summing all these actions, one gets a whole that is greater than the sum of the parts. This collective behavior is observed in the so-called swarm effects, in which insects, birds, fish and humans, acting collectively, reach much larger deeds than if they had acted individually.
Condorcet proved that mathematically, inspiring illuminist leaders which used his ideas as base to the formation of democracies in the 18th and 19th centuries.
In a contemporary way, we can look at a database as a large lake of individual experiences that form a collective. Big Data is responsible for collecting and organizing this data and Advanced Analytics for improving, creating and re-creating things (disruption) through intensive statistics and AI.
In a judicious scrutiny, it is possible to understand AI as an artificial implementation of agents that use the same principles of CI – Collective Intelligence.
That is, instead of real ants or bees, artificial neurons and/or insects are used in a computational world (cloud), that in some ways simulate the real-world behavior and thus obtains from the intelligence of the masses: decisions, responses and creations.
For instance, this piece used to support a bridge in the Dutch capital, The Hague.
On the left side is the original piece created by engineers. In the middle and on the right, two pieces created from an AI approach called genetic algorithm. The right-hand piece is 50% smaller and uses 75% less material, and yet, because of its design, it is capable of sustaining the same dynamic load of its left counterpart.
There are hundreds of cases of AI use cases, ranging from the detection of smiles on cameras and cell phones to cars that move autonomously in the midst of cars with human drivers in big cities.
Each AI use case relies on a set of techniques that can involve Machine Learning, insights discovery and optimal decision making throughout predictive and prescriptive Advanced Analytics and Creative Computing.
The intensive use of CI and AI can generate new products and services creating disruptions that we see today in some industries promoted by companies like Uber, Tesla, Netflix and Embraer.
In the case of Uber, they heavily use the CI to generate competition and at the same time collaboration between drivers and passengers, which is complemented by AI algorithms in delivering a reliable transportation service at a cost never before available.
Despite being 100% digital, it is revolutionizing the way we are transported and very soon will launch its 100% autonomous taxis and, in the near future, drones that transport their passengers through the skies. This is a clear example of digital transformation from redesign through the perspective of Industry 4.0.
Tesla uses CI from the captured data of the drivers of its electric cars and, applying Advanced Analytics, optimizes its own process and still uses them to train the AI that today is able to drive a car safely in the midst of the traffic of big cities of the world.
Tesla is a remarkable example of Industry 4.0. They use CI and AI to design their innovative products, a chain of automated factories to produce them and sell them online. And very soon they will transport and deliver their products to the buyer’s door with their new electric and autonomous trucks, completely closing the Industry 4.0 cycle.
Netflix, in turn, uses the access history to movies and notes gave by its users to generate a list of preferences recommendations that serve as input to the creation of originals such as the hits House of Cards and Stranger Things. In addition, they use the AI of the Bandit algorithm (from Netflix itself) to generate title covers and list curation, which attracts users (viewers) to consume new content.
Embraer, the world’s third largest producer of civil aircraft and the largest innovation company in the Brazil, uses AI, CI and Advanced Analytics in equipment maintenance systems.
By using these techniques it is possible, based on maintenance experiments and risk mitigation procedures applied to an IA, to reduce the costs of troubleshooting processes in high-value equipment, up to 18% savings in an industry where apparently low margins can generate considerable competitive impact.
Conclusions and Recommendations
The path to industry 4.0 is paved by the techniques of CI, AI, Advanced Analytics, Big Data, Digital Transformation and Service Design and with good examples of global leaders.
Transformation is often a process that can generate anxiety and discomfort, but it is necessary to achieve the virtues of Industry 4.0.
We suggest starting small and thinking big, start thinking about Data, they are the building blocks of all Digital Transformation. Start by feeding a Data Culture into your business / department / industry.
And how do you start thinking about Data? Start with the definition of your dictionaries, they will be your nautical charts in the middle of the Digital Transformation journey.
Understanding the potential of data and the new business they can generate is instrumental in the transition from producer of physical goods to service providers, that can be supported by physical products or not. See Uber and AirBnb, both have no cars or real estate, but are responsible for a generous share of the transportation and accommodation market.
We recommend raising the degree of maturity beginning with a diagnosis, then the elaboration of a plan of action and its application.
At Aquarela we have developed a Business Analytics Canvas Model which is a Service Design tool for the development of new business based on Data. It is possible to promote the intensive use of CI, AI in the stages of Design and Services, the links that characterize the change from Industry 3.0 to 4.0.
We will soon publish more about Business Analytics Canvas Model and Service Design techniques for Advanced Analytics and AI.
Founder of Aquarela, CEO and architect of the VORTX platform. Master in Engineering and Knowledge Management, enthusiast of new technologies, having expertise in Scala functional language and algorithms of Machine Learning and IA.
Founder of Aquarela and Director of Digital Expansion, Master in Business Information Technology at University of Twente – The Netherlands. Professor and lecturer in the area of Data Science, specialist in intelligence systems architecture and new business development for industry.
Aquarela starts September engaged with the life valorization campaign, bringing to light a subject that has to be talked about. All the way from schools until the corporate word, mental suffering can be silently present of with colleague, neighbor or relative and a refuge can make all the difference for them.
Suicide is a phenomenon that is presents in all cultures, since the beginning of human history. It relates to characteristics related to emotional, mental social and economical aspects.
The person suffers from feelings’ ambivalence; they do not want to die, but they want to put an end to their psychic pain (or physical when dealing with chronical cases). Since the subject is seen as a taboo, full of prejudice, the subject gets stigmatized, which difficulties the reaching of for help or simply for having a conversation. The subject is simply avoided.
However, this year, the ‘blue whale’ “fever” as well as the ‘13 Reasons Why’ TV-Series raised the public interest regarding suicide. Some parents lost their sleep and search for information and gathered help from health professionals. But, the thought of suicide, is not present only on the minds of the young; it is present in other age groups, including the elderly. And that is one more reason why suicide has to be discussed.
The good news is that suicide can be prevented, as long as it gets treated as a case of public health associated to information and prevention projects. Below follows some relevant data.
World Health Organization Data
According to the Pan American Health Organization (PATH/WHO):
over 800 000 people die every year from suicide;
suicide the the second main death cause of young people between age of 15 and 29;
only 60 of the 172 member nations provide data that is considered to have good quality;
it is estimated that 28 countries have national suicide prevention strategies;
in the Mental Health Action Plan 2013-2020, the WHO member states have committed to reduce the suicide rates in 10% until 2020;
around 75% of suicides happen in countries of medium and low income;
men from wealthy countries commit three times more suicide than females;
in high income countries the highest suicide rates are related to abuse of alcohol and depression;
90% of all suicides can be avoided;
in Brazil the average is of 6 to 7 death for every 100 000 inhabitants, which is considered low. However, that data is not reliable, since the quality of data in our country has a lot of room for improvements.
“Every 40 seconds one person dies by suicide”
Artificial Intelligence and suicide
Artificial Intelligence (AI) can provide means for identifying patterns and suicial behavioral tendencies, helping to refine preventive actions.
Recently suicidal movements, such as the previously mentioned ‘Blue Whale’, have gained visibility through their dissemination on the social networks. There are also cases of people who manifestate their feeling individually, also through the social networks.
Considering that, the implementation of Artificial Intelligence algorithms and big data techniques can provide precise inference regarding individuals which need help. Companies like Facebook, Instagram and Google have already announced that they will use AI on their platforms for providing warnings and prevention.
But much more can be done with the new technologies, putting together technologists, teachers, professors, psychologists and other professionals. They can provide preventive measures and identify possible suicidals, and they can also provide protection through means of a support network.
An analysis from Aquarela
Based on the death records of 645 municipalities from the state of São Paulo, Joni Hoppen, one of the Aquarela’s founders, found out that:
from 300 000 deaths, 2.223 were suicides;
he identified that most of the deaths are unknown or not informed professions. The exception were masons;
the lack of professional identification can lead to suicide, or, health professionals and family have great difficulties describing those peoples’ jobs;
Joni had difficulties trying to identify if masons really committed suicide, or if the deaths are related to work accidents which were informed as suicided due to labor issues;
he applied a filter for “lawyers” which returned 18. The ratio of lawyers in the state in comparison which other professional occupations such as janitors, shopkeepers and security guards indicates that favorable economic situation are also present in the statistics;
Humans construct their identity based on personal, social and professional relations. Jobs represent socio-historical meanings, the role of an individual in the society and this roles affects how each person is seen by the other and also how they evaluate themselves. When those visions became dysfunctional health issues such as depression and suicidal thoughts can appear..
In order for people that are considering suicide not to be ashamed or afraid of reaching out for professional help, it is necessary to have information and welcoming environment.
It is necessary to be open to their pains and sufferings, without judgment or prejudices, showing interest and being available for them.
The discussion of the issue helps the population as well institutions to establish strategies and prevention. One of the objectives when intervening is to recover the self esteem, promote emotional well-being and to establish bonds of affection that can provide a support network for the individuals.
In Brazil, we have the Centro de Valorização da Vida(CVV) (Health Valorization Center), a NGO that provides free voluntary services of emotional aid and suicide prevention through chat, telephone, Skype and email. Alway with keeping the individual’s privacy.