In this post we will look at how you can implement some basic sql code to solve almost every data science problem you will encounter. Granted SVM, RandomForest, and GBM usually outperform logistic regression and KNN, it does not mean these two are not used or necessary.
Lets start of with some basics, you will need MySQL installed and have a basic understanding of SQL. If SQL is a little confusing then I recommend checking out a tutorial online prior.
We are going to use MySQL to accomplish the following:
We will be using two classification data sets provide by UCLA.
https://archive.ics.uci.edu/ml/datasets/Adult
https://archive.ics.uci.edu/ml/datasets/Covertype
We will do the following for each data set.
Lets start of with some basics, you will need MySQL installed and have a basic understanding of SQL. If SQL is a little confusing then I recommend checking out a tutorial online prior.
We are going to use MySQL to accomplish the following:
- Frequency Table
- Mean
- Median
- Standard Deviation
- Log Transformation
- Z-Scores (Outlier detection)
- Correlation
- Multiple Linear Regression
- Naive Bayes
- KNN
- Kmeans
We will be using two classification data sets provide by UCLA.
https://archive.ics.uci.edu/ml/datasets/Adult
https://archive.ics.uci.edu/ml/datasets/Covertype
We will do the following for each data set.
- Build tables
- Import Data
- Report basic statistics
- Split into test and train
- Run through algorithm and validate