What is a technological stack?

What is a technological stack?

The stack represents a set of integrated systems to run a single application without additional software. In this way and above all, one of the main goals of a technology stack is to improve communication about how an application is built. In addition, the chosen technology package may contain:

  • the programming languages ​​used;
  • structures and tools that a developer needs to interact with the application;
  • known performance attributes and limitations;
  • survey of strengths and weaknesses of the application in general.

As a rule, stacks must have a specific purpose. For instance, if we look at the the web 3.0 stack (what is web 3.0?), you will see how much different it is in relation to a data analysis stack in statistical R language. That is, the construction of a stack you should always ask: What is the underlying business purpose?

Where does this term come from?

The term comes from the software development community and along with it it is also quite common to speak of a full-stack developer.

A full-stack developer is, in turn, the professional who knows how to work in all layers of technologies of a 100% functional application.

Why is the technological stack so important?

Firstly, on the one hand, the accountant has all company transactions registered for financial management, on the other hand, developers and project leaders need the information of the development team.

Secondly, developers cannot manage their work effectively without at least knowing what is happening, what are the available technology assets (systems, databases, programming languages, communication protocols) and so on.

The technological stack is just as important as lifting inventory control from a company that sells physical products. It is in the technological stack that both the business strategy and the main learning (maturity) of system tests that the company has been through are concentrated.

The technological stack the working dictionary of developers in the same manner data analytics look at their data dictionaries to understand the meaning of variables and columns. It is an important item of maturity in the governance of organizations.

Without prior knowledge of the technological stack, management is unable to plan hiring, risk mitigation plans, plans to increase service capacity and, of course, the strategy for using data in the business area.

Technology stacks are particularly useful for hiring developers, analysts and data scientists.

“Companies that try to recruit developers often include their technology stack in their job descriptions.”

For this reason, professionals interested in advancing their careers should pay attention to the strategy of personal development of their skills in a way that is in line with market demand.

Technological stack example

The professional social network, Linkedin, for example: it is composed of a combination of structures and programming languages ​​and artificial intelligence algorithms to be online. So, here are some examples of technologies used in their stack:

Technological Stack – Linkedin for 300 million hits – Author Philipp Weber (2015)

Is there a technological stack for analytics?

Yes, currently the area of ​​analytics, machine learning, artificial intelligence are known for the massive use of techniques and technologies of information systems. Likewise, analytical solutions require very specific stacks to meet functional (what the system should do) and non-functional (how the system will do – security, speed, etc.) business requirements for each application.

As the foundation of a house, the order in which the stack is built is important and is directly linked to the maturity of the IT and analytics teams, so we recommend reading this article – The 3 pillars of the maturity of the analytics teams (in Portuguese).

In more than 10 years of research in different types of technologies, we have gone through several technological compositions until we reached the conformation of the current Aquarela Vortx platform. The main stack results for customers are:

  • Reduction of technological risk (learning is already incorporated in the stack);
  • technological update;
  • speed of deployment and systems integration (go-live);
  • maturity of the maintenance of the systems in production and;
  • the quality of the interfaces and flows in the production environment as the stack makes the maintenance of technicians’ knowledge more efficient.

Conclusions and recommendations

In conclusion, we presented our vision of the technological stack concept and how it is also important for analytical projects. Which, in turn, impacts strategic planning. Yet, it is worth bearing in mind that technological stacks are just like business, always evolving.

The success of defining successful stacks is directly linked to the maturity of the IT and analytics teams (The 3 pillars of the maturity of the analytics teams – In Portuguese).

Regardless of the sector, the decisions involved in shaping the technological stack are a factor of success or failure in IT and analytics projects. Because, they directly interfere in the operation and in the business strategy.

Finally, we recommend reading this other article on technology mitigation with support from specialized companies – (How to choose the best data analytics provider? in Portuguese).

What is Aquarela Advanced Analytics?

Aquarela Analytics is Brazilian pioneering company and reference in the application of Artificial Intelligence in industry and large companies. With the Vortx platform and DCIM methodology, it serves important global customers such as Embraer (aerospace & defence), Scania and Randon Group (automotive), Solar Br Coca-Cola (beverages), Hospital das Clínicas (healthcare), NTS-Brasil (oil & gas), Votorantim Energia (energy), among others.

Stay tuned following Aquarela’s Linkedin!

Author

What are outliers and how to treat them in Data Analytics?

What are outliers and how to treat them in Data Analytics?

What are Outliers? They are data records that differ dramatically from all others, they distinguish themselves in one or more characteristics. In other words, an outlier is a value that escapes normality and can (and probably will) cause anomalies in the results obtained through algorithms and analytical systems. There, they always need some degrees of attention.

Understanding the outliers is critical in analyzing data for at least two aspects:

  1. The outliers may negatively bias the entire result of an analysis;
  2. the behavior of outliers may be precisely what is being sought.

While working with outliers, many words can represent them depending on the context. Some other names are: Aberration, oddity, deviation, anomaly, eccentric, nonconformist, exception, irregularity, dissent, original and so on. Here are some common situations in which outliers arise in data analysis and suggest best approaches on how to deal with them in each case.

How to identify which record is outlier?

Find the outliers using tables

The simplest way to find outliers in your data is to look directly at the data table or worksheet – the dataset, as data scientists call it. The case of the following table clearly exemplifies a typing error, that is, input of the data. The field of the individual’s age Antony Smith certainly does not represent the age of 470 years. Looking at the table it is possible to identify the outlier, but it is difficult to say which would be the correct age. There are several possibilities that can refer to the right age, such as: 47, 70 or even 40 years.

Antony Smith age outlier
Antony Smith age outlier

In a small sample the task of finding outliers with the use of tables can be easy. But when the number of observations goes into the thousands or millions, it becomes impossible. This task becomes even more difficult when many variables (the worksheet columns) are involved. For this, there are other methods.

Find outliers using graphs

One of the best ways to identify outliers data is by using charts. When plotting a chart the analyst can clearly see that something different exists. Here are some examples that illustrate the view of outliers with graphics.

Case: outliers in the Brazilian health system

In a study already published on Aquarela’s website, we analyzed the factors that lead people no-show in medical appointments scheduled in the public health system of the city of Vitória in the state of Espirito Santo, which caused and approximate loss of 8 million US dollars a year million.  

In the dataset, several patterns have been found, for example: children are practically not missing their appointments; and women attend consultations much more than men. However, a curious case was that of an outlier, who at age 79 scheduled a consultation 365 days in advance and actually showed up in her appointment.

This is a case, for example, of a given outlier that deserves to be studied, because the behavior of this lady can bring relevant information of measures that can be adopted to increase the rate of attendance in the schedules. See the case in the chart below.

sample of 8000 appointments
sample of 8000 appointments

Case: outliers in the Brazilian financial market

On May 17, 2017 Petrobras shares fell 15.8% and the stock market index (IBOVESPA) fell 8.8% in a single day. Most of the shares of the Brazilian stock exchange saw their price plummet on that day. This strong negative variation had as main motivation the Joesley Batista, one of the most shocking political events that happened in the first half of 2017.

This case represents an outlier for the analyst who, for example, wants to know what was the average daily return on Petrobrás shares in the last 180 days. Certainly, the Joesley’ facts strongly affected the average down. In analyzing the chart below, even in the face of several observations, it is easy to identify the point that disagrees with the others.

Petrobras 2017

The data of the above example may be called outlier, but if taken literally, it can not necessarily be considered a “outlier.” The “curve” in the above graph, although counter-intuitive, is represented by the straight line that cuts the points. Still from the graph above you can see that although different from the others, the data is not exactly outside the curve.

A predictive model could easily infer with high precision that a 9% drop in the stock market index would represent a 15% drop in Petrobras’ share price. In another case, still with data from the Brazilian stock market, the stock of the company Magazine Luiza appreciated 30.8% on a day when the stock market index rose by only 0.7%. This data, besides being an atypical point, distant from the others, also represents an outlier. See the chart:

This is an outlier case that can harm not only descriptive statistics calculations, such as the mean and median, for example, but it also affects the calibration of predictive models.

Find outliers using statistical methods

A more complex but quite precise way of finding outliers in a data analysis is to find the statistical distribution that most closely approximates the distribution of the data and to use statistical methods to detect discrepant points. The following example represents the histogram of the known driver metric “kilometers per liter”.

The dataset used for this example is a public dataset greatly exploited in statistical tests by data scientists. The dataset contains “Motor Trend US magazine” of 1974 and comprises several aspects about the performance of 32 models. More details at this link.

The histogram is one of the main and simplest graphing tools for the data analyst to use in understanding the behavior of the data.

In the histogram below, the blue line represents what the normal (Gaussian) distribution would be based on the mean, standard deviation and sample size, and is contrasted with the histogram in bars.

The red vertical lines represent the units of standard deviation. It can be seen that cars with outlier performance for the season could average more than 14 kilometers per liter, which corresponds to more than 2 standard deviations from the average.

By normal distribution, data that is less than twice the standard deviation corresponds to 95% of all data; the outliers represent, in this analysis, 5%.

Outliers in clustering

In this video in English (with subtitles) we present the identification of outliers in a visual way using a visual clustering process with national flags.

Conclusions: What to do with outliers?

We have seen  that it is imperative to pay attention to outliers because they can bias data analysis. But, in addition to identifying outliers we suggest some ways to better treat them:

  • Exclude the discrepant observations from the data sample: when the discrepant data is the result of an input error of the data, then it needs to be removed from the sample;
  • perform a separate analysis with only the outliers: this approach is useful when you want to investigate extreme cases, such as students who only get good grades, companies that make a profit even in times of crisis, fraud cases, among others. use clustering methods to find an approximation that corrects and gives a new value to the outliers data.
  • in cases of data input errors, instead of deleting and losing an entire row of records due to a single outlier observation, one solution is to use clustering algorithms that find the behavior of the observations closest to the given outlier and make inferences of which would be the best approximate value.

Finally, the main conclusion about the outliers can be summarized as follows:

“a given outlier may be what most disturbs his analysis, but may also be exactly what you are looking for.”

What is Aquarela Advanced Analytics?

Aquarela Analytics is Brazilian pioneering company and reference in the application of Artificial Intelligence in industry and large companies. With the Vortx platform and DCIM methodology, it serves important global customers such as Embraer (aerospace & defence), Scania and Randon Group (automotive), Solar Br Coca-Cola (beverages), Hospital das Clínicas (healthcare), NTS-Brasil (oil & gas), Votorantim Energia (energy), among others.

Stay tuned following Aquarela’s Linkedin!

Authors

14 sectors for applying Big Data and their input datasets

14 sectors for applying Big Data and their input datasets

Hello folks, 

In the vast majority of talks with clients and prospects about Big Data, we soon realized an astonishing gap between the business itself and the expectations of Data Analytics projects. Therefore, we carried out a research to respond the following questions: 

  • What are the main business sectors that already use Big Data?
  • What are the most common Big Data results per sector?
  • What is the minimum dataset to reach the results per sector

The summary is organized in the table below.

,Business type / sector,Raw data examples,Business Opportunities,, ,"1 - Bank, Credit and Insurance ","Transaction history. Registration forms. External references such as the Credit Protection Service. Micro and macro economic indices. Geographic and demographic data.","Credit approval. Interest rates changes. Market analysis. Prediction of default . Fraud detection. Identifying new niches. Credit risk analysis.",, ,2 - Security,"Access history. Registration form. Texts of news and WEB content.",Pattern detection of physical or digital behaviours that offer any type of risk.,, ,3 - Health,"Medical records. Geographic and demographic data. Sequencing genomes.","Predictive diagnosis (forecast). Analysis of genetic data. Detection of diseases and treatments. Map of health based on historical data. Adverse effects of medications / treatments.",, ,"4 - Oil, gas and electricity",Distributed sensor data.,"Optimization of production resources. Prediction / fault and found detection.",, ,5 - Retail,"Transaction history. Registration form. Purchase path in physical and/or virtual stores. Geographic and demographic data. Advertising data. Customer complaints.","Increasing sales by product mix optimization based on behaviour patterns during purchase. Billing analysis (as-is, trends), the high volume of customers and transactions, credit profile by regions. Increasing satisfaction / loyalty.",, ,6 - Production,"Data management system / ERP production. Market Data.","Optimization of production over sales. Decreased time / amount of storage. Quality control.",, ,7 - Representative organizations,"Customer's registration form. Event data. Business process management and CRM systems.","Suggestion of optimal combinations of company profiles, customers, business leverage to suppliers. Synergy opportunities identification.",, ,8 - Marketing,"Micro and macroeconomic indices. Market research. Geographic and demographic data. Content generated by users. Data from competitors. ","Market segmentation. Optimizing the allocation of advertising resources. Finding niche markets. Performance brand / product. Identifying trends.",, ,9 - Education,"Transcripts and frequencies. Geographic and demographic data. ","Personalization of education. Predictive analytics for school evasion.",, ,10 - Financial / Economic,"List of assets and their values. Transaction history. Micro and macroeconomics indexes.","Identify the optimal value of buying complex assets with many analysis variables (vehicles, real estate, stocks, etc.). Determining trends in asset values. Discovery of opportunities.",, ,11 - Logistic,"Data products. Routes and delivery points.","Optimization of goods flows. Inventory optimization.",, ,12 - E-commerce,"Customer registration. Transaction history. Users' generated content.","Increased sales through automatic product recommendations. Increased satisfaction / loyalty.",, ,"13 - Games, social networks and platforms (freemium)","Access history. Registration of users. Geographic and demographic data.",Increase free users conversion rate for paying users by detecting the behaviour and preferences of users. ,, ,14 - Recruitment,"Registration of prospects employees. Professional history, CV. Conections on social networks.","The person's profile evaluation for a specific job role. Criteria for hiring, promotions and dismissal. Better allocation of human resources.",,

Conclusions

  • The table presents a summary for easy understanding of the subject. However, for each business there are many more variables, opportunities and of course, risks. It is highly recommended to use multivariate analysis algorithms to help you prioritize the data and reduce project’s cost and complexity.
  • There are many more sectors in which excellent results have been derived from Big Data and data science methodology initiatives. However we believe that these can serve as examples for the many other types of similar businesses willing to use Big Data.
  • Common to all sectors, Big Data projects need to have relevant and clear input data; therefore it is important to have a good understanding of these datasets and the business model itself. We’ve noticed that currently many businesses haven’t been yet collecting the right data in their systems, which suggests the need pre-Big Data projects. (We will write about this soon). 
  • One obstacle for Big Data projects is the great effort to collect, organize, and clean the input data. This can surely cause overall frustration on stakeholders.
  • At least as far as we are concerned, plug & play Big Data solutions that automatically get the data and bring the analysis immediately still don’t exist. In 100% of the cases, all team members (technical and business) need to cooperate, creating hypothesis, selecting data samples, calibrating parameters, validating results and then drawing conclusions. In this way, an advanced scientific based methodology must be used to take into account business as well as technical aspects of the problem.

What is Aquarela Advanced Analytics?

Aquarela Analytics is Brazilian pioneering company and reference in the application of Artificial Intelligence in industry and large companies. With the Vortx platform and DCIM methodology, it serves important global customers such as Embraer (aerospace & defence), Scania and Randon Group (automotive), Solar Br Coca-Cola (beverages), Hospital das Clínicas (healthcare), NTS-Brasil (oil & gas), Votorantim Energia (energy), among others.

Stay tuned following Aquarela’s Linkedin!

7 characteristics to differentiate BI, Data Mining and Big Data

7 characteristics to differentiate BI, Data Mining and Big Data

Hi everybody

One of the most frequent questions in our day-to-day work at Aquarela is related to a common misconception of the concepts Business Intelligence (BI), Data Mining, and Big Data. Since all of them deal with exploratory data analysis, it is not strange to see wide misunderstandings. Therefore, the purpose of this post is to quickly illustrate what are the most striking features of each one helping readers define their information strategy, which depends on organization’s  strategy, maturity level and its context.

The basics of each involve the following steps:

  1. Survey questions: What does the customer want to learn (find out) of his/her business.3. How many customers do we serve each month? What is the average value of the product? Which product sells best?
  2. Study of data sources: What data are available internal / external data to answer business questions Where are the data? How can I have these data? How can I process them?
  3. Setting the size (scope) of the project: Who will be involved in the project? What is the size of the analysis or the sample? which will be the tools used? and how much will it be charged.
  4. Development: operationalization of the strategy, performing several, data transformations, processing, interactions with the stakeholders to validate the results and assumptions, finding out if the business questions were well addressed and results are consistent.

Until now the Bi, Data Mining and BigData virtually the same, right? So, in the table below we made a summary of what makes them different from each other in seven characteristics followed by important conclusions and suggestions.

Comparative table (Click to enlarge the image)

Comparative table Aquarela English

Conclusions and Recommendations

Although our research restricts itself to 7 characteristics, the results show that there are significant and important differences between the BI, Data Mining and BigData, serving as initial framework for helping decision maker to analysed and decide that fits best they business needs.  the most important points are:

  • We see that companies with a consolidated BI solution have more maturity to embark on extensive Data mining and/or Big Data, projects. Discoveries made by Data mining or Big Data can be quickly tested and monitored by a BI solution. So, the solutions can and must coexist.
  • The Big Data makes sense only in large volumes of data and the best option for your business depends on what questions are being asked and what the available data. All solutions are input data dependent. Consequently if the quality of the information sources is poor, the chances are that the answer is wrong: “garbage in, garbage out”.
  • While the panels of BI can help you to make sense of your data in a very visual and easy way, but you cannot do intense statistical analysis with it. This requires more complex solutions along side data scientists to enrich the perception of the business reality, by mean of finding new correlations, new market segments (classification and prediction), designing infographics showing global trends based on multivariate analysis).
  • Big Data extend the analysis to unstructured data, e.g. social networking posts, pictures, videos, music and etc. However, the degree of complexity increases significantly requiring experts data scientists in close cooperation with business analysts.
  • To avoid frustration is important to take into consideration differences of the value proposition of each solution and its outputs. Do not expect realtime monitoring data of a Data Mining project. In the same sense do not expect that a BI solution discovers new business insights, this is the role of the business operations of the other two solutions.
  • Big Data can be considered partly the combination of BI and Data Mining. While BI comes with a set of structured data in Data Mining comes with a range of algorithms and data discovery techniques. The makes Big Data a plus is the new large distributed processing technology, storage and memory to digest gigantic volumes of data with a wide range of heterogeneous data, more specifically non-structured data.
  • The results of the three can generate intelligence for business, just as the good use of a simple spread sheet can also generate intelligence, but it is important to assess whether this is sufficient to meet the ambitions and dilemmas of your business.
  • The true power of Big Data has not yet been fully recognized, however today’s most advanced companies in terms of technology base their entire strategy on the power and advanced analytics given by Big Data, in many cases they offer their services free of charge to gathering valuable data from the users. E.g.:  Gmail, Facebook, Twitter and OLX.
  • The complexity of data as well as its volume and file types tend to keep growing as presented in a previous post. This implies on the growing demand for Big Data solutions.

In the next post we will present what are interesting sectors for applying data exploratory and how this can be done for each case. Thank you for join us.

What is Aquarela Advanced Analytics?

Aquarela Analytics is Brazilian pioneering company and reference in the application of Artificial Intelligence in industry and large companies. With the Vortx platform and DCIM methodology, it serves important global customers such as Embraer (aerospace & defence), Scania and Randon Group (automotive), Solar Br Coca-Cola (beverages), Hospital das Clínicas (healthcare), NTS-Brasil (oil & gas), Votorantim Energia (energy), among others.

Stay tuned following Aquarela’s Linkedin!

More information

From Data to Innovation

From Data to Innovation

Hello folks,

In this article we present an overlook of 4 pivotal concepts for the upcoming publications on the blog. We are going to explore the notion of data, information, knowledge and wisdom, which are associated to different levels of innovation potential (the capacity of transforming reality). We portrayed this on the image below:

We understand that two forces, complexity and value, respectively positioned in the vertical and horizontal coordinates, sustain the Innovation Potential. Therefore, the more on top right corner the higher the potential of innovation impact.

Besides the scale of to potential reality transformation, the graph also points out elements related to the human and computational elements according to each level within the scale.

Following are the meanings of each color elements;

  • Dark blue – Human with your senses, practices and experiences.
  • Orange – Enhancers’ elements of innovation.
  • Green – Potential development phases of innovation to the decision-making.
  • Light blue – Digital tools and computers that serve to support human labor, defined here as cognitive prostheses, which help us to perform creative tasks in increasing complexity.

To define each of the elements of innovation enhancers (data, information, knowledge and wisdom), we will use a thermometer as metaphor.

02 - termometro

Looking at the figure, what can we infer from it? What is data, information and knowledge on it? Thus, what possible conclusions can we draw from the reading of a thermometer (intuition and wisdom)?

 In this case, there is the following description for each enhancers innovation elements:

  • Data – the signs that are not interpreted, the lowest grain and the raw material of the knowledge scale. They arise from what we experience (life events) and what is captured by our senses and by electronic devices. In the example, the 36.2 number is just a number or a sign. Data could also be characters such as “@”, “T”, “——-” and so on.
  • Information – is a set of data organized within a scale, showing a series of grouped events (data). The thermometer’s letter “C” (Celsius) is the scale, which could also be represented by Fahrenheit degrees, however, it would have different values. We humans, somehow, memorize the data obtained by the senses (organizing and classifying into a scale), on the other hand, computing systems retain data in lists, spreadsheets, documents, databases, among others.
  • Knowledge – is a type of contextual information that can change something or someone, being somehow justified. Knowing the temperature of 38 degrees (given) on the Celsius scale (information) that indicates the person has a fever (average temperature of a healthy person is 36.4 degrees) thus, some action must be taken. We humans can reflect on an issue and make a decision, on the other hand, computer systems use algorithms for this; both are based on data and information. An algorithm, for example, can find products consumption patterns in a supermarket or be used to improve the traffic of a city.
  • Wisdom – placed at the top of the scale, the wisdom becomes subjective and seemingly irrational (illogical) composed by complex series of arguments that navigate quickly through the previous three phases. What treatment should be applied to patients suffering fever? People with great experience can give wise counsel. In computer systems we use instruments to connect data from different areas to bring increasingly intelligent answers (Web 3.0, also known as the Semantic Web).

In this article we provide a bit o Aquarela’s views on the notions of data, information, knowledge and wisdom. Although there are a lot of debate on the topic its consensus is still on the way. In particular, the knowledge presented here are useful and will be the foundation stone to explain how the path of Web 3.0 is being paved by Big Data and the Linked Open Data in our point of view. For those who want to know more about the subjects covered in this article, we bring some reading suggestions following.

What is Aquarela Advanced Analytics?

Aquarela Analytics is Brazilian pioneering company and reference in the application of Artificial Intelligence in industry and large companies. With the Vortx platform and DCIM methodology, it serves important global customers such as Embraer (aerospace & defence), Scania and Randon Group (automotive), Solar Br Coca-Cola (beverages), Hospital das Clínicas (healthcare), NTS-Brasil (oil & gas), Votorantim Energia (energy), among others.

Stay tuned following Aquarela’s Linkedin!

References

  • GETTIER, E. L. Is justified true belief knowledge? Analysis, [S.l.], v. 23, n. 6, p. 121–123, 1963.
  • DRUCKER, P. F. The new realities. New Brunswick, NJ: Transaction Publishers, 2003.
  • SANTOS, Marcos. Um modelo para a gestão colegiada orientada ao significado por meio da realização de PCDAs. Dissertação (Mestrado em Engenharia e Gestão do Conhecimento). Programa de Engenharia e Gestão do Conhecimento, Universidade Federal de Santa Catarina (UFSC), Florianópolis, 2003. Disponível em <http://btd.egc.ufsc.br/wp-content/uploads/2013/03/Marcos_Henrique_dos_Santos.pdf>.
Send this to a friend