Geographic Normalization: what is it and what are its implications?

Geographic Normalization: what is it and what are its implications?

There is great value in representing reality through visualizations, especially spatial information. If you’ve seen a map, you know that the polygons that make up the political boundaries of cities and states are generally irregular (see Figure 1a). This irregularity makes it difficult to conduct analyzes and, therefore, cannot be dealt with by traditional Business Intelligence tools.

Notice the green dot in Figure 1b, it is over the polygon (‘neighborhood’) n.14, located between n.16 and n.18. So answer now: which region is having the greatest influence on the green dot? Is it neighborhood n.16 or n.18? Is the green dot representative of region n.14, region n.16 or n.18?

To answer questions like these and to minimize the bias generated by visualizations with irregular polygons, the Vortx Platform does what is known as Geographic Normalization, transforming irregular polygons into polygons of a single size and regular shape (see Figure 1c).

After the “ geographic normalization ”, it is possible to analyze the data of a given space by means of absolute statistics, not only relative, and without distortions caused by polygons of different sizes and formats.

normalização geográfica - mapa Florianópolis
Figure 1 – Source: Adapted from the Commercial and Industrial Association of Florianópolis – ACIF (2018)

Every day, people, companies and governments make countless decisions considering the geographic space. Which gym is closest to home for me to enroll? Where should we install the company’s new Distribution Center? Or, where should the Municipality place the health centers?

So, in today’s article, we propose two questions:

  1. What happens when georeferenced information is distorted?
  2. How close can our generalizations about space get?

Geographic standardization

Working with polygons and regions

Recalling that the concept of polygon is derived from geometry, being defined as: “a flat, closed figure formed by straight line segments”. When the polygon has all equal sides and, consequently, all equal angles, we can call it a regular polygon. When this does not happen, it is defined as an irregular polygon.

We use the political division of a territory to understand its contrasts, usually delimiting between Nations, States and Municipalities, for example, but we can also delimit regions according to several characteristics, such as the Caatinga region, the Amazon Basin region and even the Eurozone or Trump and Biden voter zones. Anyway, it is only necessary to surround a certain place in space by some common characteristic. Regional polygons, therefore, are widely used to represent certain regions or the organization of a territory of those regions.

Several market tools fill polygons with different shades of colors, according to the region’s data, looking for contrasts among them. But be careful! In case the sizes and shapes of the polygons are not constant, there may be geographic biases, making the visualization susceptible to misinterpretation.

Thus, the polygon approach becomes limited in the following aspects:

  • Comparisons between regions unevenly;
  • Requiredness to relativize indicators by number of population, area or other factors;
  • It does not allow more granular analyzes;
  • Demands more attention from analysts when creating statements about certain regions.

Purpose of Geographic standardization

Therefore, the reason for the existence of geographic normalization is to overcome the typical problems associated with data analysis related to irregular polygons, transforming the organization of the territory into a set of polygons (in this case, hexagons) of regular size and shape.

In the example below, we compare the two approaches:

1) Analysis with mesoregional polygons and; 2) Hexagons over the southeastern region of Brazil.

Normalização da Geografia | geographic normalization
Figure 2 – Source: Aquarela Advanced Analytics (2020)

Geographic Normalization seeks to minimize possible distortions of analysis generated by irregular polygons by replacing them with polygons of regular shape and size. This provides an elegant, eye-pleasing and precise alternative, capable of showing initially unknown patterns.

Normalization makes the definition of neighborhoods between polygons clearer and simpler, including promoting better adherence to artificial intelligence algorithms that search for patterns and events that are spatially autocorrelated.

After all, according to the First Law of Geography:

“All things are related to everything else, but things close are more related than distant things.” 

Waldo Tobler

Geographic normalization can also be done in different ways, such as by equilateral triangles or squares. However, the hexagon provides the least bias among these due to the smaller size of its side walls.

With the normalization, it is possible to summarize the statistics of points (inhabitants, homes, schools, health centers, supermarkets, industries, etc.) contained within these hexagons so that there is constancy in the area of ​​analysis and, of course, significant statistics of these summaries. Mature analytics companies, with a robust and well-consolidated datalake, have an advantage in this type of approach. Also check out our article on How to choose the best AI or data analytics provider?

Usage of normalized geography

Normalized geography can also be used through interactive maps. Maps of this type allow a very interesting level of approximation in the analyzes, as we can see in the animation below, where we show a visualization of the Vortx Platform that presents schools in the city of Curitiba, Brazil.

The darker the hexagon, the greater the number of schools. Note that we can also access other data through the pop-up and change the size of the hexagon as wished.

“The greater the amount of point data available in a region, the smaller the possible size of the hexagons”. 

Limitations of the standardized analysis

Like any representation of reality, models that use standardized analysis – although of great value in decision making – do not completely replace the illustration of spatial data in irregular polygons, especially when:

  • There is a clear political division to be considered;
  • There is no reasonable amount of data;
  • There is no consensus on the size of regular polygons.

In addition, the computational process to produce normalized maps must also be taken into consideration, since the processing of the data in this is not limited to the number of observations of the analyzed phenomenon, but also to the treatment of the geography under analysis. For example, conventional workstations can take hours to process basic geostatistical calculations for the 5573 cities in Brazil.

Geographic Normalization – Conclusions and recommendations 

In this article we explain geographic normalization, its importance, advantages and cautions for conducting spatial analyzes. In addition, we compared two important approaches to spatial data analysis. It is worth noting that these approaches are complementary in order to have a better understanding of the distribution of data on space. Therefore, we recommend viewing the analyzes in multiple facets.

We realized that, when designing the geographic space in an equitable way, a series of benefits to the analyzes becomes feasible, such as:

  • Alignment of the size of views according to business needs;
  • Adaption of the visualizations according to the availability of data;
  • Being able to make “fair” comparisons through absolute indicators of each region;
  • Observation of intensity areas with less bias;
  • Simplification of neighborhood definition between polygons, thus providing better adherence to spatial algorithms;
  • Finding patterns and events that autocorrelate in space with greater accuracy;
  • Usage of artificial intelligence algorithms (supervised and unsupervised) to identify points of interest that would not be identified without standardization. More information at: Application of Artificial Intelligence in georeferenced analyzes.

Finally, every tool has a purpose, geo-referenced visualizations can lead to bad or good decisions.

Therefore, using the correct visualization, along with the right and well-implemented algorithms, based on an appropriate analytical process, can enhance critical decisions that will lead to great competitive advantages that are so important in face of current economic challenges.

What is Aquarela Advanced Analytics?

Aquarela Analytics is Brazilian pioneering company and reference in the application of Artificial Intelligence in industry and large companies. With the Vortx platform and DCIM methodology, it serves important global customers such as Embraer (aerospace), Randon Group (automotive), Solar Br Coca-Cola (food), Hospital das Clínicas (health), NTS- Brazil (oil and gas), Votorantim (energy), among others.

Stay tuned following Aquarela’s Linkedin!

Authors

AI provider? How to choose the best AI and Data Analytics provider?

AI provider? How to choose the best AI and Data Analytics provider?

Choosing a artificial intelligence provider for analytics projects, dynamic pricing, demand forecasting is, without a doubt, a process that should be on the table of every manager in the industry. Therefore, in case you are considering to speed up the process, an exit and the hiring of companies specialized in the subject.

A successful implementation of analytics is, to a large extent, a result of a well-balanced partnership between the internal teams and the teams a analytics service provider, so this is an important decision. Herein, we will cover some of key concerns.

Assessing the AI provider based on competencies and scale

First, you must evaluate your options based on the skills of the analytics provider. Below we bring some for criteria:

  • Consistent working method in line with your organization’s needs and size.
  • Individual skills of team members and way of working.
  • Experience within your industry, as opposed to the standard market offerings.
  • Experience in the segment of your business.
  • Commercial maturity of solutions such as the analytics platform.
  • Market reference and ability to scale teams.
  • Ability to integrate external data to generate insights you can’t have internally.

Whether developing an internal analytics team or hiring externally, the fact is that you will probably spend a lot of money and time with your analytics and artificial intelligence provider(partner), so it is important that they bring the right skills to your department’s business or process.

Consider all the options in the analytics offering.

We have seen many organizations limit their options to Capgemini, EY, Deloitte, Accenture and other major consultancies or simply developing internal analytics teams. Although:

But there are many other good options on the market, including the Brazilian ones which are worth paying attention to the their rapid growth. Mainly within the main technological centers of the country, such as: in Florianópolis or Campinas.

Adjust expectations and avoid analytical frustrations

We have seen, on several occasions, the frustrated creation of fully internal analytics teams, be they for configuring data-lakes, data governance, machine learning or systems integration.

The scenario for the adoption of AI is similar, at least per hour, to the time when companies developed their own internal ERPs in data processing departments. Today of the 4000 largest technology accounts in Brazil, only 4.2% maintain the development of internal ERP, of which the predominant are banks and governments, which makes total sense from the point of view of strategy and core business.

We investigated these cases a little more and noticed that there are at least four factors behind the results:

  • Non-data-driven culture and vertical segmentation prevent the necessary flow (speed and quantity) of ideas and data that make analytics valuable.
  • Projects waterfall management style performed in the same manner as if the teams where creating a physical artifacts or ERP systems, this style is not suitable for analytics.
  • Difficulty in hiring professionals with knowledge of analytics in the company’s business area together with the lack of on-boarding programs suited to the challenges.
  • Technical and unforeseen challenges happen very often, so it is necessary to have resilient professionals used to these cognitive capoeira (as we call here). Real life datasets are never ready and are as calibrated as those of the examples of machine learning of the passengers of the titanic dataset. They usually have outliers (What are outliers?), They are tied to complex business processes and full of rules as in the example of the dynamic pricing of London subway tickets (Article in Portuguese).

While there is no single answer to how to deploy robust analytics and governance and artificial intelligence processes, remember that you are responsible for the relationship with these teams, and for the relationship between the production and analytics systems.

Understand the strengths of analytics provider, but also recognize their weaknesses

It is difficult to find resources with depth and functional and technical qualities in the market, especially if the profile of your business is industrial, involving knowledge of rare processes, for instance, the physical chemical process for creating brake pads or other specific materials.

But, like any organization, these analytics provider can also have weaknesses, such as:

  • Lack of international readiness in the implementation of analytics (methodology, platform), to ensure that you have a solution implemented fast.
  • Lack of migration strategy, data mapping and ontologies
  • No guarantee of transfer of knowledge and documentation.
  • Lack of practical experience in the industry.
  • Difficulty absorbing the client’s business context

Therefore, knowing the provider’s methods and processes well is essential.
The pillars of a good Analytics and AI project are the Methodology and its Technological Stack (What is a technological stack?). Therefore, seek to understand about the background of the new provider, ask about their experiences with other customers of similar size to yours.

Also, try to understand how this provider solved complex challenges in other businesses, even if these are not directly linked to your challenge.

Data Ethics

Ethics in the treatment of data is a must have, therefore we cannot fail to highlight this topic of compliance. It is not just now that data is becoming the center of management’s attention, however new laws are being created as example of GDPR in Europe and LGPD in Brazil.

Be aware to see how your data will be treated, transferred and saved by the provider, and if his/her name is cleared on google searches of even public organizations.

Good providers are those who, in addition to knowing the technology well, have guidelines for dealing with the information of your business, such as:

  • It has very clear and defined security processes
  • Use end-to-end encryption
  • Track your software updates
  • Respect NDAs (Non-disclosure Agreements) – NDAs should not be simply standard when it comes to data.
  • All communication channels are aligned and segmented by security levels.
  • They are well regarded by the data analysis community.

Conclusions and recommendations

Choosing your Analytics provider is one of the biggest decisions you will make for your organization’s digital transformation.

Regardless of which provider you choose for your company, it is important that you assemble an external analytics consulting team that makes sense for your organization, that has a technological successful and proven business track that supports your industry’s demand.

What is Aquarela Advanced Analytics?

Aquarela Analytics is Brazilian pioneering company and reference in the application of Artificial Intelligence in industry and large companies. With the Vortx platform and DCIM methodology, it serves important global customers such as Embraer (aerospace), Randon Group (automotive), Solar Br Coca-Cola (food), Hospital das Clínicas (health), NTS- Brazil (oil and gas), Votorantim (energy), among others.

Stay tuned following Aquarela’s Linkedin!

Author

AI for demand forecasting in the food industry

AI for demand forecasting in the food industry

The concept of a balance point between supply and demand is used to explain various situations in our daily lives, from bread in the neighborhood bakery, which can be sold at the equilibrium price, which equals the quantities desired by buyers and sellers, to the negotiation of securities of companies in the stock market.

On the supply side, a definition of the correct price to be practiced and mainly the quantity are common issues in the planning and execution of the strategy of several companies.

In this context, how are technological innovations in the data area establishing themselves in the food sector?

The construction of the demand forecast

The projection of demand is often built through historical sales data, growth prospects for the sector or even targets set to engage sales of a certain product.

When considering only these means of forecasting, without considering the specific growth of each SKU (Stock Keeping Unit), companies can fall into the traps of subjectivity or generalism.

The expansion of a sector does not result in a growth of the same magnitude for the entire product mix. For example, does a projected annual growth of 6% for the food sector necessarily imply equivalent growth for the noble meat segment?

Possibly not, as this market niche may be more resilient or sensitive than the food sector, or it may even suffer from recent changes in consumer habits.

Impacts of Demand Forecasting Errors

For companies, mainly large ones with economies of scale and geographic capillarity, an error in the forecast of demand can cause several consequences, such as:

  • Stock break;
  • Perishable waste (What is FIFO?);
  • Drop in production;
  • Idle stock (slow moving)
  • Pricing errors

Adversities like these directly impact the companies’ final results, as they result in loss of market share, increase in costs or low optimization in the dilution of fixed costs, growth in the loss of perishable products, frustration of employees in relation to the goals and mainly break in the confidence of recurring customers who depend on supply for their operations.

The demand forecast in the food sector

The food industry is situated in a context of highly perishable products with the following characteristics:

  • High inventory turnover;
  • Parallel supply in different locations;
  • Large number of Skus, points of production and points of sale;
  • Verticalized supply chain;
  • Non-linearity in data patterns;
  • Seasonality.

These characteristics make the sector a business niche that is more sensitive to deviations in demand forecast and adjacent planning.

Supply chain opportunity

As an alternative to the traditional demand forecast format, there are opportunities to use market and AI data to assist managers in the S&OP (Sales & Operations Planning) process, as well as in the S&OE (Sales and Operations Execution) process.

During the S&OP process, demand forecasting supported by AI facilitates the work of the marketing and sales areas, as well as reducing uncertainty and increasing predictability for the supply chain areas.

In the S&OE process, AI can be used to identify new opportunities and to correct deviations from what was planned.

In addition to the technical attributes that AI can add to the process, the data base reduces points of conflict between teams, reduces historical disputes between preferences for SKUs and makes the process more transparent between areas.

Previously, in our blog, we addressed the challenges of forecasting demand in our view (pt. 1 in portuguese). In the articles, we cite the differentials of the predictive approach in relation to demand, taking into account factors such as seasonality, geographic / regional preferences and changes in consumer behavior.

We understand that the need for a predictive approach through data, mainly external to the company, is increasingly latent.

The role of machine learning in the food sector

The use of AI through machine learning techniques associated with a coherent technological stack of analytics (What is a technological stack?) Provides greater information speed, data organization with different granularities (region, state, city and neighborhood), adjustments seasonality, exploration of opportunities and decision making in real time.

In the case of the food sector, the greatest accuracy in forecasting demand means:

  • Inventory optimization among Distribution Centers (CDs);
  • Reduction of idle stocks;
  • Decrease in disruptions that cause loss of market share due to substitute products;
  • Direct reduction in losses with perishability (FIFO).

The great technical and conceptual challenge faced by data scientists (The profile of data scientists in the view of Aquarela), however, is the modeling of analysis datasets (what are datasets?) That will serve for the proper training of machines.

Please note that:

“Performing machine training with data from the past alone will cause the machines to replicate the same mistakes and successes of the past, especially in terms of pricing, so the goal should be to create hybrid models that help AI replicate with more intensity and emphasis the desired behaviors of the management strategy “.

In the case of Aquarela Analytics, the demand forecast module of Aquarela Tactics makes it possible to obtain forecasts integrated into corporate systems and management strategies. It was created based on real national-wide retail data and algorithms designed to meet specific demands in the areas of marketing, sales, supply chain, operations and planning (S&OP and S&OE).

Conclusions and recommendations

In this article, we present some key characteristics of the operation of demand forecasting in the food sector. We also comment, based on our experiences, on the role of structuring analytics and AI in forecasting demand. Both are prominent and challenging themes for managers, mathematicians and data scientists.

Technological innovations in forecasting, especially with the use of Artificial Intelligence algorithms, are increasingly present in the operation of companies and their benefits are increasingly evident in industry publications.

In addition to avoiding negative points of underestimating demand, the predictive approach, when done well, makes it possible to gain market share in current products and a great competitive advantage in forecasting opportunities in other niches before competitors.

What is Aquarela Advanced Analytics?

Aquarela Analytics is Brazilian pioneering company and reference in the application of Artificial Intelligence in industry and large companies. With the Vortx platform and DCIM methodology, it serves important global customers such as Embraer (aerospace), Randon Group (automotive), Solar Br Coca-Cola (food), Hospital das Clínicas (health), NTS- Brazil (oil and gas), Votorantim (energy), among others.

Stay tuned following Aquarela’s Linkedin!

What is a technological stack?

What is a technological stack?

The stack represents a set of integrated systems to run a single application without additional software. In this way and above all, one of the main goals of a technology stack is to improve communication about how an application is built. In addition, the chosen technology package may contain:

  • the programming languages ​​used;
  • structures and tools that a developer needs to interact with the application;
  • known performance attributes and limitations;
  • survey of strengths and weaknesses of the application in general.

As a rule, stacks must have a specific purpose. For instance, if we look at the the web 3.0 stack (what is web 3.0?), you will see how much different it is in relation to a data analysis stack in statistical R language. That is, the construction of a stack you should always ask: What is the underlying business purpose?

Where does this term come from?

The term comes from the software development community and along with it it is also quite common to speak of a full-stack developer.

A full-stack developer is, in turn, the professional who knows how to work in all layers of technologies of a 100% functional application.

Why is the technological stack so important?

Firstly, on the one hand, the accountant has all company transactions registered for financial management, on the other hand, developers and project leaders need the information of the development team.

Secondly, developers cannot manage their work effectively without at least knowing what is happening, what are the available technology assets (systems, databases, programming languages, communication protocols) and so on.

The technological stack is just as important as lifting inventory control from a company that sells physical products. It is in the technological stack that both the business strategy and the main learning (maturity) of system tests that the company has been through are concentrated.

The technological stack the working dictionary of developers in the same manner data analytics look at their data dictionaries to understand the meaning of variables and columns. It is an important item of maturity in the governance of organizations.

Without prior knowledge of the technological stack, management is unable to plan hiring, risk mitigation plans, plans to increase service capacity and, of course, the strategy for using data in the business area.

Technology stacks are particularly useful for hiring developers, analysts and data scientists.

“Companies that try to recruit developers often include their technology stack in their job descriptions.”

For this reason, professionals interested in advancing their careers should pay attention to the strategy of personal development of their skills in a way that is in line with market demand.

Technological stack example

The professional social network, Linkedin, for example: it is composed of a combination of structures and programming languages ​​and artificial intelligence algorithms to be online. So, here are some examples of technologies used in their stack:

Technological Stack – Linkedin for 300 million hits – Author Philipp Weber (2015)

Is there a technological stack for analytics?

Yes, currently the area of ​​analytics, machine learning, artificial intelligence are known for the massive use of techniques and technologies of information systems. Likewise, analytical solutions require very specific stacks to meet functional (what the system should do) and non-functional (how the system will do – security, speed, etc.) business requirements for each application.

As the foundation of a house, the order in which the stack is built is important and is directly linked to the maturity of the IT and analytics teams, so we recommend reading this article – The 3 pillars of the maturity of the analytics teams (in Portuguese).

In more than 10 years of research in different types of technologies, we have gone through several technological compositions until we reached the conformation of the current Aquarela Vortx platform. The main stack results for customers are:

  • Reduction of technological risk (learning is already incorporated in the stack);
  • technological update;
  • speed of deployment and systems integration (go-live);
  • maturity of the maintenance of the systems in production and;
  • the quality of the interfaces and flows in the production environment as the stack makes the maintenance of technicians’ knowledge more efficient.

Conclusions and recommendations

In conclusion, we presented our vision of the technological stack concept and how it is also important for analytical projects. Which, in turn, impacts strategic planning. Yet, it is worth bearing in mind that technological stacks are just like business, always evolving.

The success of defining successful stacks is directly linked to the maturity of the IT and analytics teams (The 3 pillars of the maturity of the analytics teams – In Portuguese).

Regardless of the sector, the decisions involved in shaping the technological stack are a factor of success or failure in IT and analytics projects. Because, they directly interfere in the operation and in the business strategy.

Finally, we recommend reading this other article on technology mitigation with support from specialized companies – (How to choose the best data analytics provider? in Portuguese).

What is Aquarela Advanced Analytics?

Aquarela Analytics is Brazilian pioneering company and reference in the application of Artificial Intelligence in industry and large companies. With the Vortx platform and DCIM methodology, it serves important global customers such as Embraer (aerospace), Randon Group (automotive), Solar Br Coca-Cola (food), Hospital das Clínicas (health), NTS- Brazil (oil and gas), Votorantim (energy), among others.

Stay tuned following Aquarela’s Linkedin!

Author

What are outliers and how to treat them in Data Analytics?

What are outliers and how to treat them in Data Analytics?

What are Outliers? They are data records that differ dramatically from all others, they distinguish themselves in one or more characteristics. In other words, an outlier is a value that escapes normality and can (and probably will) cause anomalies in the results obtained through algorithms and analytical systems. There, they always need some degrees of attention.

Understanding the outliers is critical in analyzing data for at least two aspects:

  1. The outliers may negatively bias the entire result of an analysis;
  2. the behavior of outliers may be precisely what is being sought.

While working with outliers, many words can represent them depending on the context. Some other names are: Aberration, oddity, deviation, anomaly, eccentric, nonconformist, exception, irregularity, dissent, original and so on. Here are some common situations in which outliers arise in data analysis and suggest best approaches on how to deal with them in each case.

How to identify which record is outlier?

Find the outliers using tables

The simplest way to find outliers in your data is to look directly at the data table or worksheet – the dataset, as data scientists call it. The case of the following table clearly exemplifies a typing error, that is, input of the data. The field of the individual’s age Antony Smith certainly does not represent the age of 470 years. Looking at the table it is possible to identify the outlier, but it is difficult to say which would be the correct age. There are several possibilities that can refer to the right age, such as: 47, 70 or even 40 years.

Antony Smith age outlier
Antony Smith age outlier

In a small sample the task of finding outliers with the use of tables can be easy. But when the number of observations goes into the thousands or millions, it becomes impossible. This task becomes even more difficult when many variables (the worksheet columns) are involved. For this, there are other methods.

Find outliers using graphs

One of the best ways to identify outliers data is by using charts. When plotting a chart the analyst can clearly see that something different exists. Here are some examples that illustrate the view of outliers with graphics.

Case: outliers in the Brazilian health system

In a study already published on Aquarela’s website, we analyzed the factors that lead people no-show in medical appointments scheduled in the public health system of the city of Vitória in the state of Espirito Santo, which caused and approximate loss of 8 million US dollars a year million.  

In the dataset, several patterns have been found, for example: children are practically not missing their appointments; and women attend consultations much more than men. However, a curious case was that of an outlier, who at age 79 scheduled a consultation 365 days in advance and actually showed up in her appointment.

This is a case, for example, of a given outlier that deserves to be studied, because the behavior of this lady can bring relevant information of measures that can be adopted to increase the rate of attendance in the schedules. See the case in the chart below.

sample of 8000 appointments
sample of 8000 appointments

Case: outliers in the Brazilian financial market

On May 17, 2017 Petrobras shares fell 15.8% and the stock market index (IBOVESPA) fell 8.8% in a single day. Most of the shares of the Brazilian stock exchange saw their price plummet on that day. This strong negative variation had as main motivation the Joesley Batista, one of the most shocking political events that happened in the first half of 2017.

This case represents an outlier for the analyst who, for example, wants to know what was the average daily return on Petrobrás shares in the last 180 days. Certainly, the Joesley’ facts strongly affected the average down. In analyzing the chart below, even in the face of several observations, it is easy to identify the point that disagrees with the others.

Petrobras 2017

The data of the above example may be called outlier, but if taken literally, it can not necessarily be considered a “outlier.” The “curve” in the above graph, although counter-intuitive, is represented by the straight line that cuts the points. Still from the graph above you can see that although different from the others, the data is not exactly outside the curve.

A predictive model could easily infer with high precision that a 9% drop in the stock market index would represent a 15% drop in Petrobras’ share price. In another case, still with data from the Brazilian stock market, the stock of the company Magazine Luiza appreciated 30.8% on a day when the stock market index rose by only 0.7%. This data, besides being an atypical point, distant from the others, also represents an outlier. See the chart:

This is an outlier case that can harm not only descriptive statistics calculations, such as the mean and median, for example, but it also affects the calibration of predictive models.

Find outliers using statistical methods

A more complex but quite precise way of finding outliers in a data analysis is to find the statistical distribution that most closely approximates the distribution of the data and to use statistical methods to detect discrepant points. The following example represents the histogram of the known driver metric “kilometers per liter”.

The dataset used for this example is a public dataset greatly exploited in statistical tests by data scientists. The dataset contains “Motor Trend US magazine” of 1974 and comprises several aspects about the performance of 32 models. More details at this link.

The histogram is one of the main and simplest graphing tools for the data analyst to use in understanding the behavior of the data.

In the histogram below, the blue line represents what the normal (Gaussian) distribution would be based on the mean, standard deviation and sample size, and is contrasted with the histogram in bars.

The red vertical lines represent the units of standard deviation. It can be seen that cars with outlier performance for the season could average more than 14 kilometers per liter, which corresponds to more than 2 standard deviations from the average.

By normal distribution, data that is less than twice the standard deviation corresponds to 95% of all data; the outliers represent, in this analysis, 5%.

Outliers in clustering

In this video in English (with subtitles) we present the identification of outliers in a visual way using a visual clustering process with national flags.

Conclusions: What to do with outliers?

We have seen  that it is imperative to pay attention to outliers because they can bias data analysis. But, in addition to identifying outliers we suggest some ways to better treat them:

  • Exclude the discrepant observations from the data sample: when the discrepant data is the result of an input error of the data, then it needs to be removed from the sample;
  • perform a separate analysis with only the outliers: this approach is useful when you want to investigate extreme cases, such as students who only get good grades, companies that make a profit even in times of crisis, fraud cases, among others. use clustering methods to find an approximation that corrects and gives a new value to the outliers data.
  • in cases of data input errors, instead of deleting and losing an entire row of records due to a single outlier observation, one solution is to use clustering algorithms that find the behavior of the observations closest to the given outlier and make inferences of which would be the best approximate value.

Finally, the main conclusion about the outliers can be summarized as follows:

“a given outlier may be what most disturbs his analysis, but may also be exactly what you are looking for.”

What is Aquarela Advanced Analytics?

Aquarela Analytics is Brazilian pioneering company and reference in the application of Artificial Intelligence in industry and large companies. With the Vortx platform and DCIM methodology, it serves important global customers such as Embraer (aerospace), Randon Group (automotive), Solar Br Coca-Cola (food), Hospital das Clínicas (health), NTS- Brazil (oil and gas), Votorantim (energy), among others.

Stay tuned following Aquarela’s Linkedin!

Authors

14 sectors for applying Big Data and their input datasets

14 sectors for applying Big Data and their input datasets

Hello folks, 

In the vast majority of talks with clients and prospects about Big Data, we soon realized an astonishing gap between the business itself and the expectations of Data Analytics projects. Therefore, we carried out a research to respond the following questions: 

  • What are the main business sectors that already use Big Data?
  • What are the most common Big Data results per sector?
  • What is the minimum dataset to reach the results per sector

The summary is organized in the table below.

,Business type / sector,Raw data examples,Business Opportunities,, ,"1 - Bank, Credit and Insurance ","Transaction history. Registration forms. External references such as the Credit Protection Service. Micro and macro economic indices. Geographic and demographic data.","Credit approval. Interest rates changes. Market analysis. Prediction of default . Fraud detection. Identifying new niches. Credit risk analysis.",, ,2 - Security,"Access history. Registration form. Texts of news and WEB content.",Pattern detection of physical or digital behaviours that offer any type of risk.,, ,3 - Health,"Medical records. Geographic and demographic data. Sequencing genomes.","Predictive diagnosis (forecast). Analysis of genetic data. Detection of diseases and treatments. Map of health based on historical data. Adverse effects of medications / treatments.",, ,"4 - Oil, gas and electricity",Distributed sensor data.,"Optimization of production resources. Prediction / fault and found detection.",, ,5 - Retail,"Transaction history. Registration form. Purchase path in physical and/or virtual stores. Geographic and demographic data. Advertising data. Customer complaints.","Increasing sales by product mix optimization based on behaviour patterns during purchase. Billing analysis (as-is, trends), the high volume of customers and transactions, credit profile by regions. Increasing satisfaction / loyalty.",, ,6 - Production,"Data management system / ERP production. Market Data.","Optimization of production over sales. Decreased time / amount of storage. Quality control.",, ,7 - Representative organizations,"Customer's registration form. Event data. Business process management and CRM systems.","Suggestion of optimal combinations of company profiles, customers, business leverage to suppliers. Synergy opportunities identification.",, ,8 - Marketing,"Micro and macroeconomic indices. Market research. Geographic and demographic data. Content generated by users. Data from competitors. ","Market segmentation. Optimizing the allocation of advertising resources. Finding niche markets. Performance brand / product. Identifying trends.",, ,9 - Education,"Transcripts and frequencies. Geographic and demographic data. ","Personalization of education. Predictive analytics for school evasion.",, ,10 - Financial / Economic,"List of assets and their values. Transaction history. Micro and macroeconomics indexes.","Identify the optimal value of buying complex assets with many analysis variables (vehicles, real estate, stocks, etc.). Determining trends in asset values. Discovery of opportunities.",, ,11 - Logistic,"Data products. Routes and delivery points.","Optimization of goods flows. Inventory optimization.",, ,12 - E-commerce,"Customer registration. Transaction history. Users' generated content.","Increased sales through automatic product recommendations. Increased satisfaction / loyalty.",, ,"13 - Games, social networks and platforms (freemium)","Access history. Registration of users. Geographic and demographic data.",Increase free users conversion rate for paying users by detecting the behaviour and preferences of users. ,, ,14 - Recruitment,"Registration of prospects employees. Professional history, CV. Conections on social networks.","The person's profile evaluation for a specific job role. Criteria for hiring, promotions and dismissal. Better allocation of human resources.",,

Conclusions

  • The table presents a summary for easy understanding of the subject. However, for each business there are many more variables, opportunities and of course, risks. It is highly recommended to use multivariate analysis algorithms to help you prioritize the data and reduce project’s cost and complexity.
  • There are many more sectors in which excellent results have been derived from Big Data and data science methodology initiatives. However we believe that these can serve as examples for the many other types of similar businesses willing to use Big Data.
  • Common to all sectors, Big Data projects need to have relevant and clear input data; therefore it is important to have a good understanding of these datasets and the business model itself. We’ve noticed that currently many businesses haven’t been yet collecting the right data in their systems, which suggests the need pre-Big Data projects. (We will write about this soon). 
  • One obstacle for Big Data projects is the great effort to collect, organize, and clean the input data. This can surely cause overall frustration on stakeholders.
  • At least as far as we are concerned, plug & play Big Data solutions that automatically get the data and bring the analysis immediately still don’t exist. In 100% of the cases, all team members (technical and business) need to cooperate, creating hypothesis, selecting data samples, calibrating parameters, validating results and then drawing conclusions. In this way, an advanced scientific based methodology must be used to take into account business as well as technical aspects of the problem.

What is Aquarela Advanced Analytics?

Aquarela Analytics is Brazilian pioneering company and reference in the application of Artificial Intelligence in industry and large companies. With the Vortx platform and DCIM methodology, it serves important global customers such as Embraer (aerospace), Randon Group (automotive), Solar Br Coca-Cola (food), Hospital das Clínicas (health), NTS- Brazil (oil and gas), Votorantim (energy), among others.

Stay tuned following Aquarela’s Linkedin!

Send this to a friend