We transform your data to make it serve you best!
Deep Neural Networks are both reliable and effective for making predictions on tabular data. Before some practitioners regarded Random Forests as the best technique for tabular data analysis 99% of the time.
Currently, the best performing techniques in the machine learning area for tabular data regression and classification tasks are widely regarded as Random Forests, Gradient Boosting machines, K Nearest Neighbours with older techniques such as Support Vector Machines, which suffer from the curse of dimensionality, finally starting to be used less.
There are many tabular data analysis tasks that a deep neural network model can be trained to perform:
Fraud detection
Sales forecasting
Product failure prediction
Pricing
Credit risk
Customer retention/churn
Recommendation systems
Ad optimisation
Anti-money laundering
Resume screening
Sales prioritisation
Call centre routing
Store layout
Store location optimisation
Staff scheduling
What normally do at DSE while working with tabular data can be briefly descrived as follows:
Feature engineering
Most current thinking in machine learning is to use feature engineering to preprocess your data to remove features, sometimes to make assumptions about the features the practitioner thinks are in the data. People used to Classic statistics, used to removing parameters.
Feature engineering is still be needed when using deep neural networks for tabular data, albeit much less. The feature engineering that is required needs much less maintenance. Ideally with Tabular data analysis with Neural Networks features aren’t removed, all of the data can be kept and augmented.
Some features may need to be carefully reviewed, as to whether they may discriminate, see the ethics section further below in this article.
Categorical and continuous variables
The data will have categorical and continuous variables. Continuous variables are numbers like such as age or weight, they have an infinite number of values between any two values. Categorical variables are those that have a selection from a discrete group, for example marital status or breed of dog.
Continuous data can be fed into the Neural Network as numbers in the same way as you would pixel values into a Deep Neural Network.
Feature preprocessing
Training a deep neural network will not do all of the required feature engineering on its own, this will find non-linearities and interactions between the features.
Where transforms would be used in image based data, instead preprocessors are used to process the tabular data in advance, once, in advance of training.
This preprocessing should include filling missing data. For continuous, data the missing values can be replaced with the median for the data set. It is also important for the Neural Network to be aware the feature was missing for that data row. A new feature can be added to indicate there was a missing value for that feature in that row, as in itself this could be valuable information. This prevents the missing feature value from skewing the predictions whilst remaining aware that the row is missing data for a feature,
Continuous variables can be normalised by subtracting the feature’s mean and divide by the feature’s standard deviation to make between 0 and 1. This makes it easier for the neural network to train.
The preprocessing that is applied to the training set must be applied to the validation and test sets in the same way.
Embeddings for categorical variables
For each categorical variable a trainable set of matrix of weights can be created, with a row for each category/class in the categorical variable. These matrices are known as embeddings. The result of this embedding matrix multiplied by a one hot encoded vector representing the category/class for the data row is then used as an input into the Neural Network. These are trained to become a set of biases for each category/class within each categorical variable.
Gartner says that most organizations evolve through five levels of maturity in their journey with data.
How can Data Science Enterprise help your organization Level Up?
Basis introduced maturity model, DSE can help businesses not just understand their individual gaps and strengths but to assess organizational maturity in data and evaluate their capabilities across these five dimensions. We often do this through questionnaires or interviews with key technology and business stakeholders.
The main focus lies with below 5 dimensions:
Vision – The clarity and focus needed to set goals for data science initiatives in the long term. The extent to which these goals align with larger organizational business strategies.
Planning – Translation of data science goals into execution plans and a robust short and long-term roadmap. How to carefully pick the individual initiatives for impact and plan them out with milestones.
Execution – Implementation of the planned data science initiatives by assembling the right data science teams, tools, and processes. Access to pertinent, good quality data that is sourced, transformed and stored effectively. Ability to identify actionable insights by applying the right level of analytics. Enabling consumption of insights through data storytelling.
Value Realization – Adoption of data science initiatives across the organization. Planning for actionability across milestones with robust measurement of ROI.
Data Culture – Scaling of data initiatives across the organization. Promoting data literacy across all teams to enable users to make decisions using data.
Would like to get more info?
Simply get in touch!