How Titanic passengers are segmented by VORTX Big Data?

How Titanic passengers are segmented by VORTX Big Data?

To demonstrate how VORTX works, I selected a well-known dataset with information about the passengers who embarked on Titanic. Despite the tragic event, this dataset is fairly rich in details and has been widely used in Machine Learning communities since it allows the application of several Big Data techniques.

In this case, I am going to apply VORTX, which it is Big Data tool focused giving automatic segmentation plus other important decision-making indicators. This technique is called clustering. More information about this on this post (How can big data clustering strategy help business)In the conclusion section, I give some ideas on how it help businesses by means of this innovative approach.

Titanic Dataset summary

According to Encyclopedia Titanica “On 10 April 1912, the new liner sailed from Southampton, England with 2,208 passengers and crew, but four days later she collided with an iceberg and sank: 1496 people died and 712 survived”.  For this analysis the data we had access we had the following figures: 

  • 1309 people on board of which 500 survived (38%) and 809 (62%) died.
  • The average age of 29.88 years (estimated).
  • 466 women of which 127 died and 339 survived.
  • 843 man, of which 682 died and 161 survived.
  • Ticket cost on average £53.65 per woman while £76.60 for man.

For more details on the complete dataset – Google for Titanic Dataset.

Factors under analysis

Unfortunately, 267 passengers (20.39%) had to be excluded from the analysis due to missing age values. Furthermore, out of 15 factors presented in the original file, I select the numerical ones with stronger weights calculated by VORTX. Usually, we classify factors, variables or data attributes in the following 3 categories:

  • Protagonist – Factors with strong positive influence to generate a valuable pattern with clarity.
  • Antagonist – Factors with noise or unclear patterns and negative influence that play against the protagonist.
  • Supporting – Factors that do not play a significant role in changing the path of the analysis, but can enrich the results.

According to the influence power, the protagonists chosen for this analysis were:

  • Age of the passenger = 87.85%
  • How much each passenger paid to embark = 72.69%
  • Number of parents on the ship = 71.69%
  • Number of siblings or spouses on the ship = 72.42%

During the calculation the gender that indicates if the passenger was male or female tended to play an antagonist role, meaning the absence of a pattern to form the groups dropping the dataset sharpness to 7%.  Therefore, it was removed.

VORTX Results and group characteristics

After processing, VORTX resulted in the following indicators, which most of them are not offered by other algorithms, therefore, I give a brief explanation for each of them:

  • Dataset Sharpness = 33.64%. It shows how clear or confident the machine is about the discovered grouping patterns. According to our dataset quality scale, sharpness above 20% is already useful for decision making.
  • Automatic discovery of segments (groups) = 8. This is a function that makes the whole process a lot easier for the data analyst. Unlike k-means and other algorithms, VORTX finds the right (ideal) number of groups by itself reducing dramatically the segmentation errors that topically happened.
  • Clustering Distinctness = How much different the elements of each group are in relation to the overall group that makes them a group. The most distinctive one is number 5 with 51.48% (darker color) and the least one group 1 with 8.58%. This means that elements from group 5 tend to more homogeneous than the other groups.
VORTX VIEW

VORTX screenshot

By analyzing the groups and checking against the ones who survived or not the trip I came to the survival rate of each group plus the average Ticket Fare, so if you have the characteristics of the group 5 or 7 you would have better chances of surviving.  (more…)

How can Big Data clustering strategy help business?

How can Big Data clustering strategy help business?

Hello folks,

To clarify the concept of clustering, which is a reoccurring theme in machine learning area (machine learning), we made a video tutorial that demonstrates a clustering problem that we can solve visually, then with finalize with a real case and some conclusions.  It is important to mention that other areas may benefit from this technique by targeting markets where you can meet different audiences according to their characteristics. We will use a video example

Below is the description of the video for those who like reading.

To facilitate the absorption of the concept, we will use a visual-based example. So, imagine that you have a textile factory and you want to produce as many flags as possible in the shortest time as with fewer materials as possible. Considering that there are around 200 national flags and each has different colors and shapes, we are interested to know which color patterns and shapes exist to optimize and organize the production line. That’s the idea, reduce costs and time while maintaining quality and volume.

All flags

Figure 1 – Representation of raw data without patterns detected

A good clustering algorithm should be able to identify patterns out of the raw data like we humans can visually identify looking at the Italian, Irish and Mexican flags like in the example below.  One factor that differentiates clustering algorithms from the classifying algorithms is that they have no hints about the patterns to study the model they must figure out automatically and this is a big challenge for practitioners.

bandeiras1

Figure 2: Cluster zero (0) composed of the Italian, Irish and the Mexican flags.

In this context, as important as to identify groups with similarities between each other and finding individuals who do not resemble any other element. The so-called outliers, which are the exceptions.

bandeiras2

Figure 3: Cluster six (6) composed of the flag of Nepal. An exception.

Finally, as the result of a good clustering process, we have the groups formed by the flags that have similar features and isolated individuals being the outliers.

bandeiras3

Figure 3: Clusters formed at the end of visual human-based processing.

One of the most important factors of clustering is the number of groups where the elements will be allocated. In many cases, we have observed very different results while applying the same data, and same parameterization in different algorithms. This is very important. See below what could be the result of an inaccurate clustering.

bandeiras4

Figure 4: Clusters result of a wrong clusterization

So, a practical question is:

Would you invest your money in this?

Probably not, and solving this problem is our challenge. A real application that we carried out was to identify the main characteristics of patients who don’t show up to their medical appointments, the well-known no-show problem that has deep implications in offices, clinics, and hospitals. The result was an amazing group with 50% of the analyzed data, which really deserves a specific policy. Doesn’t this give reason to the chief financial officers of these organizations?

Other possible applications of the clustering strategy were presented in this post “14 sectors for application of Big Data and data necessary for analysis.”

Some conclusions

  • Our vision is very powerful clustering images as in the case of flags.
  • It is humanly impossible to do analysis and logical correlations of numbers from a large database, so the clustering algorithms were created.
  • The accuracy of the results of clustering is crucial for making investment decisions.
  • Several sectors can benefit from this management approach.

Thank you!

 

What is Aquarela Advanced Analytics?

Aquarela Analytics is Brazilian pioneering company and reference in the application of Artificial Intelligence in industry and large companies. With the Vortx platform and DCIM methodology, it serves important global customers such as Embraer (aerospace), Randon Group (automotive), Solar Br Coca-Cola (food), Hospital das Clínicas (health), NTS- Brazil (oil and gas), Votorantim (energy), among others.

Stay tuned following Aquarela’s Linkedin!

Linked Data in practice – Guest lecture at The Developers Conference

Linked Data in practice – Guest lecture at The Developers Conference

Hi Folks,
In this post, we prepared a video about Linked Data, its concepts and cases. The presentation took place at The Developers Conference in Florianopolis – Santa Catarina – Brazil. The video discuss the following topics:
  • What is Linked Data?
  • What is the relation between Linked Data, Semantic Web and the future of the internet
  • Examples of Linked Data applications in Brazil, The United States and England
  • SPARQL query demonstration
  • Applications suggestions
The target audience are application developers, business developers and policy makers. By the way, the language used in the video is Brazilian Portuguese, but don’t worry, it has English subtitles. If you have any questions, please let us know. We will be glad to hear your comments.
 

What is Aquarela Advanced Analytics?

Aquarela Analytics is Brazilian pioneering company and reference in the application of Artificial Intelligence in industry and large companies. With the Vortx platform and DCIM methodology, it serves important global customers such as Embraer (aerospace), Randon Group (automotive), Solar Br Coca-Cola (food), Hospital das Clínicas (health), NTS- Brazil (oil and gas), Votorantim (energy), among others.

Stay tuned following Aquarela’s Linkedin!

14 sectors for applying Big Data and their input datasets

14 sectors for applying Big Data and their input datasets

Hello folks, 

In the vast majority of talks with clients and prospects about Big Data, we soon realized an astonishing gap between the business itself and the expectations of Data Analytics projects. Therefore, we carried out a research to respond the following questions: 

  • What are the main business sectors that already use Big Data?
  • What are the most common Big Data results per sector?
  • What is the minimum dataset to reach the results per sector

The summary is organized in the table below.

,Business type / sector,Raw data examples,Business Opportunities,, ,"1 - Bank, Credit and Insurance ","Transaction history. Registration forms. External references such as the Credit Protection Service. Micro and macro economic indices. Geographic and demographic data.","Credit approval. Interest rates changes. Market analysis. Prediction of default . Fraud detection. Identifying new niches. Credit risk analysis.",, ,2 - Security,"Access history. Registration form. Texts of news and WEB content.",Pattern detection of physical or digital behaviours that offer any type of risk.,, ,3 - Health,"Medical records. Geographic and demographic data. Sequencing genomes.","Predictive diagnosis (forecast). Analysis of genetic data. Detection of diseases and treatments. Map of health based on historical data. Adverse effects of medications / treatments.",, ,"4 - Oil, gas and electricity",Distributed sensor data.,"Optimization of production resources. Prediction / fault and found detection.",, ,5 - Retail,"Transaction history. Registration form. Purchase path in physical and/or virtual stores. Geographic and demographic data. Advertising data. Customer complaints.","Increasing sales by product mix optimization based on behaviour patterns during purchase. Billing analysis (as-is, trends), the high volume of customers and transactions, credit profile by regions. Increasing satisfaction / loyalty.",, ,6 - Production,"Data management system / ERP production. Market Data.","Optimization of production over sales. Decreased time / amount of storage. Quality control.",, ,7 - Representative organizations,"Customer's registration form. Event data. Business process management and CRM systems.","Suggestion of optimal combinations of company profiles, customers, business leverage to suppliers. Synergy opportunities identification.",, ,8 - Marketing,"Micro and macroeconomic indices. Market research. Geographic and demographic data. Content generated by users. Data from competitors. ","Market segmentation. Optimizing the allocation of advertising resources. Finding niche markets. Performance brand / product. Identifying trends.",, ,9 - Education,"Transcripts and frequencies. Geographic and demographic data. ","Personalization of education. Predictive analytics for school evasion.",, ,10 - Financial / Economic,"List of assets and their values. Transaction history. Micro and macroeconomics indexes.","Identify the optimal value of buying complex assets with many analysis variables (vehicles, real estate, stocks, etc.). Determining trends in asset values. Discovery of opportunities.",, ,11 - Logistic,"Data products. Routes and delivery points.","Optimization of goods flows. Inventory optimization.",, ,12 - E-commerce,"Customer registration. Transaction history. Users' generated content.","Increased sales through automatic product recommendations. Increased satisfaction / loyalty.",, ,"13 - Games, social networks and platforms (freemium)","Access history. Registration of users. Geographic and demographic data.",Increase free users conversion rate for paying users by detecting the behaviour and preferences of users. ,, ,14 - Recruitment,"Registration of prospects employees. Professional history, CV. Conections on social networks.","The person's profile evaluation for a specific job role. Criteria for hiring, promotions and dismissal. Better allocation of human resources.",,

Conclusions

  • The table presents a summary for easy understanding of the subject. However, for each business there are many more variables, opportunities and of course, risks. It is highly recommended to use multivariate analysis algorithms to help you prioritize the data and reduce project’s cost and complexity.
  • There are many more sectors in which excellent results have been derived from Big Data and data science methodology initiatives. However we believe that these can serve as examples for the many other types of similar businesses willing to use Big Data.
  • Common to all sectors, Big Data projects need to have relevant and clear input data; therefore it is important to have a good understanding of these datasets and the business model itself. We’ve noticed that currently many businesses haven’t been yet collecting the right data in their systems, which suggests the need pre-Big Data projects. (We will write about this soon). 
  • One obstacle for Big Data projects is the great effort to collect, organize, and clean the input data. This can surely cause overall frustration on stakeholders.
  • At least as far as we are concerned, plug & play Big Data solutions that automatically get the data and bring the analysis immediately still don’t exist. In 100% of the cases, all team members (technical and business) need to cooperate, creating hypothesis, selecting data samples, calibrating parameters, validating results and then drawing conclusions. In this way, an advanced scientific based methodology must be used to take into account business as well as technical aspects of the problem.

What is Aquarela Advanced Analytics?

Aquarela Analytics is Brazilian pioneering company and reference in the application of Artificial Intelligence in industry and large companies. With the Vortx platform and DCIM methodology, it serves important global customers such as Embraer (aerospace), Randon Group (automotive), Solar Br Coca-Cola (food), Hospital das Clínicas (health), NTS- Brazil (oil and gas), Votorantim (energy), among others.

Stay tuned following Aquarela’s Linkedin!

7 characteristics to differentiate BI, Data Mining and Big Data

7 characteristics to differentiate BI, Data Mining and Big Data

Hi everybody

One of the most frequent questions in our day-to-day work at Aquarela is related to a common misconception of the concepts Business Intelligence (BI), Data Mining, and Big Data. Since all of them deal with exploratory data analysis, it is not strange to see wide misunderstandings. Therefore, the purpose of this post is to quickly illustrate what are the most striking features of each one helping readers define their information strategy, which depends on organization’s  strategy, maturity level and its context.

The basics of each involve the following steps:

  1. Survey questions: What does the customer want to learn (find out) of his/her business.3. How many customers do we serve each month? What is the average value of the product? Which product sells best?
  2. Study of data sources: What data are available internal / external data to answer business questions Where are the data? How can I have these data? How can I process them?
  3. Setting the size (scope) of the project: Who will be involved in the project? What is the size of the analysis or the sample? which will be the tools used? and how much will it be charged.
  4. Development: operationalization of the strategy, performing several, data transformations, processing, interactions with the stakeholders to validate the results and assumptions, finding out if the business questions were well addressed and results are consistent.

Until now the Bi, Data Mining and BigData virtually the same, right? So, in the table below we made a summary of what makes them different from each other in seven characteristics followed by important conclusions and suggestions.

Comparative table (Click to enlarge the image)

Comparative table Aquarela English

Conclusions and Recommendations

Although our research restricts itself to 7 characteristics, the results show that there are significant and important differences between the BI, Data Mining and BigData, serving as initial framework for helping decision maker to analysed and decide that fits best they business needs.  the most important points are:

  • We see that companies with a consolidated BI solution have more maturity to embark on extensive Data mining and/or Big Data, projects. Discoveries made by Data mining or Big Data can be quickly tested and monitored by a BI solution. So, the solutions can and must coexist.
  • The Big Data makes sense only in large volumes of data and the best option for your business depends on what questions are being asked and what the available data. All solutions are input data dependent. Consequently if the quality of the information sources is poor, the chances are that the answer is wrong: “garbage in, garbage out”.
  • While the panels of BI can help you to make sense of your data in a very visual and easy way, but you cannot do intense statistical analysis with it. This requires more complex solutions along side data scientists to enrich the perception of the business reality, by mean of finding new correlations, new market segments (classification and prediction), designing infographics showing global trends based on multivariate analysis).
  • Big Data extend the analysis to unstructured data, e.g. social networking posts, pictures, videos, music and etc. However, the degree of complexity increases significantly requiring experts data scientists in close cooperation with business analysts.
  • To avoid frustration is important to take into consideration differences of the value proposition of each solution and its outputs. Do not expect realtime monitoring data of a Data Mining project. In the same sense do not expect that a BI solution discovers new business insights, this is the role of the business operations of the other two solutions.
  • Big Data can be considered partly the combination of BI and Data Mining. While BI comes with a set of structured data in Data Mining comes with a range of algorithms and data discovery techniques. The makes Big Data a plus is the new large distributed processing technology, storage and memory to digest gigantic volumes of data with a wide range of heterogeneous data, more specifically non-structured data.
  • The results of the three can generate intelligence for business, just as the good use of a simple spread sheet can also generate intelligence, but it is important to assess whether this is sufficient to meet the ambitions and dilemmas of your business.
  • The true power of Big Data has not yet been fully recognized, however today’s most advanced companies in terms of technology base their entire strategy on the power and advanced analytics given by Big Data, in many cases they offer their services free of charge to gathering valuable data from the users. E.g.:  Gmail, Facebook, Twitter and OLX.
  • The complexity of data as well as its volume and file types tend to keep growing as presented in a previous post. This implies on the growing demand for Big Data solutions.

In the next post we will present what are interesting sectors for applying data exploratory and how this can be done for each case. Thank you for join us.

What is Aquarela Advanced Analytics?

Aquarela Analytics is Brazilian pioneering company and reference in the application of Artificial Intelligence in industry and large companies. With the Vortx platform and DCIM methodology, it serves important global customers such as Embraer (aerospace), Randon Group (automotive), Solar Br Coca-Cola (food), Hospital das Clínicas (health), NTS- Brazil (oil and gas), Votorantim (energy), among others.

Stay tuned following Aquarela’s Linkedin!

More information