The rise of the Self-taught Programmer

The rise of the Self-taught Programmer

The desire to become a self-taught programmer or developer is at an all-time high right now and the pandemic is partially to blame for the rapid growth of this profession. During the pandemic, a lot of physical jobs were lost, but the Tech industry experienced an immense amount of growth in revenue and job opportunities, and these opportunities started to attract unemployed people or just ordinary people looking to get a slice from the industry. 

The Tech industry comes with really good if not one of the best working conditions and benefits ever. The most famous one being the benefit of working from home or what is also known as the “Home office”. 

With all these shiny benefits, people started looking for easier ways to join the tech industry, basically, without going through the hassle of paying for universities and/or colleges and having to study for years and years and this resulted in the explosion in the number of self-taught programmers.

What does it mean to be a Self-Taught programmer?

When you go to a University or college, you have a fixed curriculum or a ‘roadmap’ that shows you exactly what to study, in which order, and how to go about doing it. However, when you take the self-taught path, things are extremely different, because you are choosing the roadmap yourself, maybe with the help of some friends or family members, or maybe even a quick search on Reddit or youtube, but the whole idea is that you are in charge of putting together your plan of action, which may not always be the best of plans, however when that plan succeeds you can gladly call yourself a “Self-taught” programmer. 

The challenges

Although it seems easy to many, being self-taught is arduous because you are constantly in battle with your doubts, with exhausting unpredictability and uncertainty.  It takes time, patience, continuous learning, doing extensive research, building projects, and a lot of failing to become a self-taught programmer, but during this whole process, you are creating or building what is referred to as a “coding muscle”.

I remember back in early 2019 when I decided to embark on the programming journey, full of excitement and cheer, ready to change the world with code, but little did I know of what was in store for me. The process was very daunting, I was doubting myself almost every day during those early stages, I would find myself asking questions like who am I to do this? I am over 30 already and without any college or university degree, so where exactly do I fit in this vast world of programming, which programming language should I learn, do I want to learn back-end or front-end? and the list went on. I am pretty sure if you are a self-taught programmer, then some of those questions might be familiar because those are just some of the stages most self-taught programmers go through.

Why you should hire self-taught programmer

Well, self-taught programmers may not have the necessary diplomas or degrees in the programming field, but I can assure you that they can outwork, outthink and outmaneuver many varsity or college graduates.

  • They have vigor, passion, and a huge inner drive to achieve 

For starters, if you are teaching yourself to code, you should either really love it or you must really want it with your whole being because it takes time, a huge amount of patience, dedication, a lot of guts, and just an immense work ethic. Most self-taught programmers possess all of these traits and much more.

  • They have support and know where to get information.

Although it might seem like a lonely journey for many, self-taught programmers actually often form part of a community, where they share their problem-solving skills and ideas with each other, and this can be an advantage for the employer because he is not only hiring one programmer but that programmer comes with a whole community of developers who possess various forms expertise in different fields or technologies, that the programmer can always tap into.

  • Always ready to go

All new employees need to go through the onboarding and training phases respectively because it is a vital experience for the employee, but it also gets expensive the more it drags on. Being self-taught mostly but not often means that you have a decent amount of real-world experience, which you picked up along your learning journey, be it in collaborative projects or freelancing gigs. So with that experience, the developer will most likely be ready to start coding in less time and with minimal training – Often saving the company time and money.

  • When all fails, they always have plan C, D, E, and more if need be

Self-taught developers are skilled problem solvers, every great developer has an extensive history of solving problems. Universities give programmers a solid base in theory, but theory goes out of the window when you encounter real-life coding problems or challenges.

A fundamental part of self-teaching is knowing how to untangle yourself when you are stuck in a situation, identifying problems, solving them, and learning from the process.

Read too: Industry 4.0: Web 3.0 and digital transformation

Conclusion

I hope this text doesn’t sound one-sided or maybe in favor of the self-taught programmer as opposed to the traditional varsity or college-educated programmer, but take it with a grain of salt. Studies have shown that happy employees are up to 13% more productive (according to the University of Oxford)  and self-taught developers are passionate about what they do, so there is no doubt that this is an advantage for the company. With all that said, I think we can all agree that the self-taught programmer is here to stay!  🎓

Did you like the article? Leave your comment.

What is Aquarela Advanced Analytics?

Aquarela Analytics is Brazilian pioneering company and reference in the application of Artificial Intelligence in industry and large companies. With the Vortx platform and DCIM methodology, it serves important global customers such as Embraer (aerospace & defence), Scania and Randon Group (automotive), Solar Br Coca-Cola (beverages), Hospital das Clínicas (healthcare), NTS-Brasil (oil & gas), Votorantim Energia (energy), among others.

Stay tuned following Aquarela’s Linkedin!

Author

AI and Analytics strategic planning: concepts and impacts

AI and Analytics strategic planning: concepts and impacts

The benefits and positive impacts of the use of data and, above all, artificial intelligence are already a reality in the Brazilian Industry. These benefits are most evident in areas ranging from dynamic pricing in education, forecasting missed medical appointments, predicting equipment breakdowns, and even monitoring the auto parts replacement market. However, to achieve these benefits, organizations need to reach a level of analytical maturity that is adequate for every challenge they face.

In this article, we are going to discuss the concepts of AI and Analytics Strategic Planning and also look at which characteristics of the scenarios demand this type of project within the Digital Transformation journey of companies towards Industry 4.0.

What is AI and Analytics strategic planning?

The AI ​​and Data Analytics Strategic Planning is a structuring project that combines a set of elaborate consultative activities (preferably by teams with an external view of the organization) for the survey of scenarios, mapping of analytical processes, elaboration of digital assets (systems, databases, and others) to assess the different levels of analytical maturity of teams, departments and the organization as a whole.

As a result, shared definitions of the vision, mission, values, policies, strategies, action plans, and good data governance practices are accomplished to leverage the organization’s analytical maturity level in the least possible time and cost.

Symptoms of low analytic maturity scenarios

Although there are many types of businesses, products, and services on the market, here we present emerging patterns that help to characterize the problem of companies analytical maturity and can generate interesting reflections:

  1. Is it currently possible to know which analytics initiatives (data analytics) have already taken place and are taking place? Who is responsible? And what were the results?
  2. In analytics initiatives, is it possible to know what data was used and even reproduce the same analysis?
  3. Does data analysis happen randomly, spontaneously, and isolated in departments?
  4. Is it possible to view all data assets or datasets available to generate analytics?
  5. Are there situations in which the same indicator appears with different values ​​depending on the department in which the analysis is carried out?
  6. Are there defined analytic data dictionaries?
  7. What is the analytical technology stack?
  8. Are data analytics structuring projects being considered in strategic planning?

Other common problems

Organizational identity

Scenarios with low analytic maturity do not have data quality problems in isolation. There are usually systemic problems that involve the complexity of business processes, the level of training of teams, knowledge management processes, and finally, the choice of technologies for operating ERP, CRM, SCM and how these transactional systems are related.

Security Issues

Companies are living organisms that constantly evolve with people working in different areas. Thus, over time, control of the access levels of each employee is lost, causing unauthorized people to have access to sensitive information and also the opposite when people cannot access the data they need for their work.

Excessive use of spreadsheets and duplicates

Spreadsheets are one of the most useful and important management tools and for that reason, they are always helping in various processes. The big side effect of excessive use of spreadsheets is the maintenance of knowledge of each process. When there are two or more people and the volume of information and updates starts to grow, it becomes difficult to manage the knowledge that travels in blocks with spreadsheets. Additionally, many duplications occur and make it virtually impossible to securely consolidate data in large volumes.

What are the benefits of AI and Analytics strategic planning?

Data-driven management is expected to provide not just drawings and sketches of operations or market conditions, but a high-resolution photograph of present and future reality. Thus, it provides subsidies for corporate strategic planning in the short, medium, and long term with the following gains:

  • Procedural and technological readiness for data lakes projects and Advanced Analytics and AI labs.
  • Increased intensity of application of scientific techniques to businesses, such as comparative analysis, scenario simulations, identification of behavior patterns, demand forecasting, and others.
  • Increased accuracy of information.
  • Security of access to information at different levels.
  • Acceleration of the onboarding processes (entry of new team members) who in turn learn more quickly the work scenario and also begin to communicate more efficiently.
  • Greater data enrichment from increased interaction of teams from different sectors for analytical challenges.
  • Increased visibility into analytics operations, Organization for localizability, accessibility, interoperability, and reuse of digital assets.
  • Optimized plan of change for data-driven Corporate Governance.
  • Incorporation of Analytical and AI mindset in different sectors.
  • Homogenization of data policies and controls.

AI and Analytics strategic planning – Conclusions and recommendations 

The preparation of strategic AI and Analytics planning is an important step to reach the level of data governance that allows the intensive use of analytics and artificial intelligence in operations since the high failure rate of analytical projects is linked to low quality of data, processes, and even the correct use of technologies (training).

Structuring projects, such as AI strategic planning and Analytics are, or at least should be, the first step in the journey of digital transformation of traditional companies. Therefore, we are convinced that in the future every successful company will have a clear and shared idea (vision, mission, and values) of what data means to them and their business model, in contrast to investments in data technology purely and simply because of the competition.

We believe that the focus on orchestrated (tidy and synchronized) data will be reflected in almost every area, for example: in the range of services, in revenue models, in key resources, processes, cost structures, in your corporate culture, in your focus on clients and networks, and in its corporate strategy.

Last but not least, it is worth pointing out that, for a successful structuring to happen, a long-term holistic approach must be taken. This means investments in optimized technology, people, and processes to enable continued business growth.

How Aquarela has been acting

Developing new technologies and new data-driven business models in a vision that the amount and availability of more data will continue to grow, taking the business to new heights of optimization.

What we do specifically for companies:

  • We analyze data-generating enterprise ecosystems.
  • We determine analytic maturity and derive action fields for data-driven organizations and services.
  • We develop and evaluate data-based services.
  • We identify and estimate the data’s potential for future business models.
  • We design science-based digital transformation processes and guide their organizational integration.

For more information – Click here.

Did you like the article? Leave your comment.

What is Aquarela Advanced Analytics?

Aquarela Analytics is Brazilian pioneering company and reference in the application of Artificial Intelligence in industry and large companies. With the Vortx platform and DCIM methodology, it serves important global customers such as Embraer (aerospace & defence), Scania and Randon Group (automotive), Solar Br Coca-Cola (beverages), Hospital das Clínicas (healthcare), NTS-Brasil (oil & gas), Votorantim Energia (energy), among others.

Stay tuned following Aquarela’s Linkedin!

Authors

AI provider? How to choose the best AI and Data Analytics provider?

AI provider? How to choose the best AI and Data Analytics provider?

Choosing a artificial intelligence provider for analytics projects, dynamic pricing, demand forecasting is, without a doubt, a process that should be on the table of every manager in the industry. Therefore, in case you are considering to speed up the process, an exit and the hiring of companies specialized in the subject.

A successful implementation of analytics is, to a large extent, a result of a well-balanced partnership between the internal teams and the teams a analytics service provider, so this is an important decision. Herein, we will cover some of key concerns.

Assessing the AI provider based on competencies and scale

First, you must evaluate your options based on the skills of the analytics provider. Below we bring some for criteria:

  • Consistent working method in line with your organization’s needs and size.
  • Individual skills of team members and way of working.
  • Experience within your industry, as opposed to the standard market offerings.
  • Experience in the segment of your business.
  • Commercial maturity of solutions such as the analytics platform.
  • Market reference and ability to scale teams.
  • Ability to integrate external data to generate insights you can’t have internally.

Whether developing an internal analytics team or hiring externally, the fact is that you will probably spend a lot of money and time with your analytics and artificial intelligence provider(partner), so it is important that they bring the right skills to your department’s business or process.

Consider all the options in the analytics offering.

We have seen many organizations limit their options to Capgemini, EY, Deloitte, Accenture and other major consultancies or simply developing internal analytics teams. Although:

But there are many other good options on the market, including the Brazilian ones which are worth paying attention to the their rapid growth. Mainly within the main technological centers of the country, such as: in Florianópolis or Campinas.

Adjust expectations and avoid analytical frustrations

We have seen, on several occasions, the frustrated creation of fully internal analytics teams, be they for configuring data-lakes, data governance, machine learning or systems integration.

The scenario for the adoption of AI is similar, at least per hour, to the time when companies developed their own internal ERPs in data processing departments. Today of the 4000 largest technology accounts in Brazil, only 4.2% maintain the development of internal ERP, of which the predominant are banks and governments, which makes total sense from the point of view of strategy and core business.

We investigated these cases a little more and noticed that there are at least four factors behind the results:

  • Non-data-driven culture and vertical segmentation prevent the necessary flow (speed and quantity) of ideas and data that make analytics valuable.
  • Projects waterfall management style performed in the same manner as if the teams where creating a physical artifacts or ERP systems, this style is not suitable for analytics.
  • Difficulty in hiring professionals with knowledge of analytics in the company’s business area together with the lack of on-boarding programs suited to the challenges.
  • Technical and unforeseen challenges happen very often, so it is necessary to have resilient professionals used to these cognitive capoeira (as we call here). Real life datasets are never ready and are as calibrated as those of the examples of machine learning of the passengers of the titanic dataset. They usually have outliers (What are outliers?), They are tied to complex business processes and full of rules as in the example of the dynamic pricing of London subway tickets (Article in Portuguese).

While there is no single answer to how to deploy robust analytics and governance and artificial intelligence processes, remember that you are responsible for the relationship with these teams, and for the relationship between the production and analytics systems.

Understand the strengths of analytics provider, but also recognize their weaknesses

It is difficult to find resources with depth and functional and technical qualities in the market, especially if the profile of your business is industrial, involving knowledge of rare processes, for instance, the physical chemical process for creating brake pads or other specific materials.

But, like any organization, these analytics provider can also have weaknesses, such as:

  • Lack of international readiness in the implementation of analytics (methodology, platform), to ensure that you have a solution implemented fast.
  • Lack of migration strategy, data mapping and ontologies
  • No guarantee of transfer of knowledge and documentation.
  • Lack of practical experience in the industry.
  • Difficulty absorbing the client’s business context

Therefore, knowing the provider’s methods and processes well is essential.
The pillars of a good Analytics and AI project are the Methodology and its Technological Stack (What is a technological stack?). Therefore, seek to understand about the background of the new provider, ask about their experiences with other customers of similar size to yours.

Also, try to understand how this provider solved complex challenges in other businesses, even if these are not directly linked to your challenge.

Data Ethics

Ethics in the treatment of data is a must have, therefore we cannot fail to highlight this topic of compliance. It is not just now that data is becoming the center of management’s attention, however new laws are being created as example of GDPR in Europe and LGPD in Brazil.

Be aware to see how your data will be treated, transferred and saved by the provider, and if his/her name is cleared on google searches of even public organizations.

Good providers are those who, in addition to knowing the technology well, have guidelines for dealing with the information of your business, such as:

  • It has very clear and defined security processes
  • Use end-to-end encryption
  • Track your software updates
  • Respect NDAs (Non-disclosure Agreements) – NDAs should not be simply standard when it comes to data.
  • All communication channels are aligned and segmented by security levels.
  • They are well regarded by the data analysis community.

Conclusions and recommendations

Choosing your Analytics provider is one of the biggest decisions you will make for your organization’s digital transformation.

Regardless of which provider you choose for your company, it is important that you assemble an external analytics consulting team that makes sense for your organization, that has a technological successful and proven business track that supports your industry’s demand.

What is Aquarela Advanced Analytics?

Aquarela Analytics is Brazilian pioneering company and reference in the application of Artificial Intelligence in industry and large companies. With the Vortx platform and DCIM methodology, it serves important global customers such as Embraer (aerospace & defence), Scania and Randon Group (automotive), Solar Br Coca-Cola (beverages), Hospital das Clínicas (healthcare), NTS-Brasil (oil & gas), Votorantim Energia (energy), among others.

Stay tuned following Aquarela’s Linkedin!

Author

AI for demand forecasting in the food industry

AI for demand forecasting in the food industry

The concept of a balance point between supply and demand is used to explain various situations in our daily lives, from bread in the neighborhood bakery, which can be sold at the equilibrium price, which equals the quantities desired by buyers and sellers, to the negotiation of securities of companies in the stock market.

On the supply side, a definition of the correct price to be practiced and mainly the quantity are common issues in the planning and execution of the strategy of several companies.

In this context, how are technological innovations in the data area establishing themselves in the food sector?

The construction of the demand forecast

The projection of demand is often built through historical sales data, growth prospects for the sector or even targets set to engage sales of a certain product.

When considering only these means of forecasting, without considering the specific growth of each SKU (Stock Keeping Unit), companies can fall into the traps of subjectivity or generalism.

The expansion of a sector does not result in a growth of the same magnitude for the entire product mix. For example, does a projected annual growth of 6% for the food sector necessarily imply equivalent growth for the noble meat segment?

Possibly not, as this market niche may be more resilient or sensitive than the food sector, or it may even suffer from recent changes in consumer habits.

Impacts of Demand Forecasting Errors

For companies, mainly large ones with economies of scale and geographic capillarity, an error in the forecast of demand can cause several consequences, such as:

  • Stock break;
  • Perishable waste (What is FIFO?);
  • Drop in production;
  • Idle stock (slow moving)
  • Pricing errors

Adversities like these directly impact the companies’ final results, as they result in loss of market share, increase in costs or low optimization in the dilution of fixed costs, growth in the loss of perishable products, frustration of employees in relation to the goals and mainly break in the confidence of recurring customers who depend on supply for their operations.

The demand forecast in the food sector

The food industry is situated in a context of highly perishable products with the following characteristics:

  • High inventory turnover;
  • Parallel supply in different locations;
  • Large number of Skus, points of production and points of sale;
  • Verticalized supply chain;
  • Non-linearity in data patterns;
  • Seasonality.

These characteristics make the sector a business niche that is more sensitive to deviations in demand forecast and adjacent planning.

Supply chain opportunity

As an alternative to the traditional demand forecast format, there are opportunities to use market and AI data to assist managers in the S&OP (Sales & Operations Planning) process, as well as in the S&OE (Sales and Operations Execution) process.

During the S&OP process, demand forecasting supported by AI facilitates the work of the marketing and sales areas, as well as reducing uncertainty and increasing predictability for the supply chain areas.

In the S&OE process, AI can be used to identify new opportunities and to correct deviations from what was planned.

In addition to the technical attributes that AI can add to the process, the data base reduces points of conflict between teams, reduces historical disputes between preferences for SKUs and makes the process more transparent between areas.

Previously, in our blog, we addressed the challenges of forecasting demand in our view (pt. 1 in portuguese). In the articles, we cite the differentials of the predictive approach in relation to demand, taking into account factors such as seasonality, geographic / regional preferences and changes in consumer behavior.

We understand that the need for a predictive approach through data, mainly external to the company, is increasingly latent.

The role of machine learning in the food sector

The use of AI through machine learning techniques associated with a coherent technological stack of analytics (What is a technological stack?) Provides greater information speed, data organization with different granularities (region, state, city and neighborhood), adjustments seasonality, exploration of opportunities and decision making in real time.

In the case of the food sector, the greatest accuracy in forecasting demand means:

  • Inventory optimization among Distribution Centers (CDs);
  • Reduction of idle stocks;
  • Decrease in disruptions that cause loss of market share due to substitute products;
  • Direct reduction in losses with perishability (FIFO).

The great technical and conceptual challenge faced by data scientists (The profile of data scientists in the view of Aquarela), however, is the modeling of analysis datasets (what are datasets?) That will serve for the proper training of machines.

Please note that:

“Performing machine training with data from the past alone will cause the machines to replicate the same mistakes and successes of the past, especially in terms of pricing, so the goal should be to create hybrid models that help AI replicate with more intensity and emphasis the desired behaviors of the management strategy “.

In the case of Aquarela Analytics, the demand forecast module of Aquarela Tactics makes it possible to obtain forecasts integrated into corporate systems and management strategies. It was created based on real national-wide retail data and algorithms designed to meet specific demands in the areas of marketing, sales, supply chain, operations and planning (S&OP and S&OE).

Conclusions and recommendations

In this article, we present some key characteristics of the operation of demand forecasting in the food sector. We also comment, based on our experiences, on the role of structuring analytics and AI in forecasting demand. Both are prominent and challenging themes for managers, mathematicians and data scientists.

Technological innovations in forecasting, especially with the use of Artificial Intelligence algorithms, are increasingly present in the operation of companies and their benefits are increasingly evident in industry publications.

In addition to avoiding negative points of underestimating demand, the predictive approach, when done well, makes it possible to gain market share in current products and a great competitive advantage in forecasting opportunities in other niches before competitors.

What is Aquarela Advanced Analytics?

Aquarela Analytics is Brazilian pioneering company and reference in the application of Artificial Intelligence in industry and large companies. With the Vortx platform and DCIM methodology, it serves important global customers such as Embraer (aerospace & defence), Scania and Randon Group (automotive), Solar Br Coca-Cola (beverages), Hospital das Clínicas (healthcare), NTS-Brasil (oil & gas), Votorantim Energia (energy), among others.

Stay tuned following Aquarela’s Linkedin!

What are outliers and how to treat them in Data Analytics?

What are outliers and how to treat them in Data Analytics?

What are Outliers? They are data records that differ dramatically from all others, they distinguish themselves in one or more characteristics. In other words, an outlier is a value that escapes normality and can (and probably will) cause anomalies in the results obtained through algorithms and analytical systems. There, they always need some degrees of attention.

Understanding the outliers is critical in analyzing data for at least two aspects:

  1. The outliers may negatively bias the entire result of an analysis;
  2. the behavior of outliers may be precisely what is being sought.

While working with outliers, many words can represent them depending on the context. Some other names are: Aberration, oddity, deviation, anomaly, eccentric, nonconformist, exception, irregularity, dissent, original and so on. Here are some common situations in which outliers arise in data analysis and suggest best approaches on how to deal with them in each case.

How to identify which record is outlier?

Find the outliers using tables

The simplest way to find outliers in your data is to look directly at the data table or worksheet – the dataset, as data scientists call it. The case of the following table clearly exemplifies a typing error, that is, input of the data. The field of the individual’s age Antony Smith certainly does not represent the age of 470 years. Looking at the table it is possible to identify the outlier, but it is difficult to say which would be the correct age. There are several possibilities that can refer to the right age, such as: 47, 70 or even 40 years.

Antony Smith age outlier
Antony Smith age outlier

In a small sample the task of finding outliers with the use of tables can be easy. But when the number of observations goes into the thousands or millions, it becomes impossible. This task becomes even more difficult when many variables (the worksheet columns) are involved. For this, there are other methods.

Find outliers using graphs

One of the best ways to identify outliers data is by using charts. When plotting a chart the analyst can clearly see that something different exists. Here are some examples that illustrate the view of outliers with graphics.

Case: outliers in the Brazilian health system

In a study already published on Aquarela’s website, we analyzed the factors that lead people no-show in medical appointments scheduled in the public health system of the city of Vitória in the state of Espirito Santo, which caused and approximate loss of 8 million US dollars a year million.  

In the dataset, several patterns have been found, for example: children are practically not missing their appointments; and women attend consultations much more than men. However, a curious case was that of an outlier, who at age 79 scheduled a consultation 365 days in advance and actually showed up in her appointment.

This is a case, for example, of a given outlier that deserves to be studied, because the behavior of this lady can bring relevant information of measures that can be adopted to increase the rate of attendance in the schedules. See the case in the chart below.

sample of 8000 appointments
sample of 8000 appointments

Case: outliers in the Brazilian financial market

On May 17, 2017 Petrobras shares fell 15.8% and the stock market index (IBOVESPA) fell 8.8% in a single day. Most of the shares of the Brazilian stock exchange saw their price plummet on that day. This strong negative variation had as main motivation the Joesley Batista, one of the most shocking political events that happened in the first half of 2017.

This case represents an outlier for the analyst who, for example, wants to know what was the average daily return on Petrobrás shares in the last 180 days. Certainly, the Joesley’ facts strongly affected the average down. In analyzing the chart below, even in the face of several observations, it is easy to identify the point that disagrees with the others.

Petrobras 2017

The data of the above example may be called outlier, but if taken literally, it can not necessarily be considered a “outlier.” The “curve” in the above graph, although counter-intuitive, is represented by the straight line that cuts the points. Still from the graph above you can see that although different from the others, the data is not exactly outside the curve.

A predictive model could easily infer with high precision that a 9% drop in the stock market index would represent a 15% drop in Petrobras’ share price. In another case, still with data from the Brazilian stock market, the stock of the company Magazine Luiza appreciated 30.8% on a day when the stock market index rose by only 0.7%. This data, besides being an atypical point, distant from the others, also represents an outlier. See the chart:

This is an outlier case that can harm not only descriptive statistics calculations, such as the mean and median, for example, but it also affects the calibration of predictive models.

Find outliers using statistical methods

A more complex but quite precise way of finding outliers in a data analysis is to find the statistical distribution that most closely approximates the distribution of the data and to use statistical methods to detect discrepant points. The following example represents the histogram of the known driver metric “kilometers per liter”.

The dataset used for this example is a public dataset greatly exploited in statistical tests by data scientists. The dataset contains “Motor Trend US magazine” of 1974 and comprises several aspects about the performance of 32 models. More details at this link.

The histogram is one of the main and simplest graphing tools for the data analyst to use in understanding the behavior of the data.

In the histogram below, the blue line represents what the normal (Gaussian) distribution would be based on the mean, standard deviation and sample size, and is contrasted with the histogram in bars.

The red vertical lines represent the units of standard deviation. It can be seen that cars with outlier performance for the season could average more than 14 kilometers per liter, which corresponds to more than 2 standard deviations from the average.

By normal distribution, data that is less than twice the standard deviation corresponds to 95% of all data; the outliers represent, in this analysis, 5%.

Outliers in clustering

In this video in English (with subtitles) we present the identification of outliers in a visual way using a visual clustering process with national flags.

Conclusions: What to do with outliers?

We have seen  that it is imperative to pay attention to outliers because they can bias data analysis. But, in addition to identifying outliers we suggest some ways to better treat them:

  • Exclude the discrepant observations from the data sample: when the discrepant data is the result of an input error of the data, then it needs to be removed from the sample;
  • perform a separate analysis with only the outliers: this approach is useful when you want to investigate extreme cases, such as students who only get good grades, companies that make a profit even in times of crisis, fraud cases, among others. use clustering methods to find an approximation that corrects and gives a new value to the outliers data.
  • in cases of data input errors, instead of deleting and losing an entire row of records due to a single outlier observation, one solution is to use clustering algorithms that find the behavior of the observations closest to the given outlier and make inferences of which would be the best approximate value.

Finally, the main conclusion about the outliers can be summarized as follows:

“a given outlier may be what most disturbs his analysis, but may also be exactly what you are looking for.”

What is Aquarela Advanced Analytics?

Aquarela Analytics is Brazilian pioneering company and reference in the application of Artificial Intelligence in industry and large companies. With the Vortx platform and DCIM methodology, it serves important global customers such as Embraer (aerospace & defence), Scania and Randon Group (automotive), Solar Br Coca-Cola (beverages), Hospital das Clínicas (healthcare), NTS-Brasil (oil & gas), Votorantim Energia (energy), among others.

Stay tuned following Aquarela’s Linkedin!

Authors