top of page

PROJECT 3

Can clever feature extraction improve predictive models of income?

 

Client: Eighty20

Led by: Prof. Maria Fasli (University of Essex) and Dr. Melvin Varughese (UCT)

Background information

Companies have more data to use than ever before, at a volume and with a variety that is unparalleled in human history. We are in the so called "Big Data Era" and while "Big Data" is the new catch phrase, analysing large datasets brings with it new challenges. Challenges that are met with very progressive, fast paced research into new statistical learning algorithms. These statistical methods are applied in a broad range of industries including finance, healthcare, economics and many more, to gain better insight into key information hidden in large datasets.

According to Bernard Marr’s 2014 book on Big Data, over 90% of all the data in the world was created in the past 2 years. This amazing adoption of data collection and data analysis is what is driving the industry forward and has created this field of data science. The trick in embracing and productively applying all of this accumulated data is to know what tools to use when, as well as to bring your own creativity to the problem.

At Eighty20, we are technology agnostic, so we find the best tool to solve each unique problem. As an organisation, we live and breathe creative and efficient problem solving, which is why our approach to the field is different. With our provided problem set, we hope that you could apply a creative solution, while learning something new about real-life challenges industries face.

Project aims and objectives

When you first get introduced to statistical learning methods, feature selection and dimension reduction are vital steps in setting up your model. Datasets seem immense and skill in knowing how to apply dimension reduction techniques effectively is imperative. However, a more common type of problem one finds in industry is what could be termed a ’limiting feature space’. Often the problem arises that only a small number of features are available to the modeller, and creativity in feature creation and hypothesis comprehension becomes important.

 

This project tackles a problem of this kind, income estimation. It would seem to be straightforward to predict an individual’s income based on given features, but the features available might be too limited to provide enough permutations to model all the distinct outcomes. This leads to a problem of ’strong mean reversion’, where there is not enough variability to be able to identify outliers.

 

A further difficulty in estimating income in the South African context is our distinctive bi-modal income distribution. This makes conventional Gaussian modeling approaches inadequate.

For our problem, the research group will be given a relatively small dataset on which to build an income estimator for a much larger cohort. The small dataset consists mainly of nominal factors. They will need to conduct exploratory analysis, and then construct an estimator that will predict incomes for new observations. This is a challenging problem to model, especially since only a few features of a person are known.

About the dataset

The research group will be provided with a training dataset that consists of roughly 54 000 observations. The data will be supplied in a semi-clean format so as not to restrict or impose any bias on the work that the research group might uncover. The features that are available to the group to model income will be location (postal code), education, gender, population group and age.

 

This data will then be used to predict the income for a dataset that consists of 520 000 observations. After weightings of observations are taken into account, the data will be representative of 1.5 million people.

Intended outcomes and real-world relevance

The intended goal of the project is the development of a predictive model for estimating income, maximizing the predictive power of such a model in whatever way possible.

 

Income estimation is important for marketing applications, such as where it is necessary to know a customer’s spending power. This would include any CRM (customer relationship management) process, which involves understanding both current and future customers. Estimates that do not provide enough variability lead to constraints in identifying the highest income earners, who provide approximately 80% of potential spending power.

bottom of page