About Me

My name is Hongyi (James) Gao and I am currently a first-year student in MS in Data Science and Analytics at Georgetown University.

I earned my Bachelor of Science degree in Economics at the University of Washington.


Loving to travel, play sports, learn new languages and get to know different cultures. I'm native in Chinese and fluent in English. I can also speak a few

French and German. Having been to over 15 countries and areas, I still feel excited to explore the world more.



Introduction

Can you believe that each year about 50 million of lives are taken by heart attack? This is kind of an exaggeration and the real number is 17.9 million per year.
However, when first heard, people tend not to corroborate this first and choose to believe it instead. This is the charm of data and this is why people are doing data analysis.

The topic for this portfolio is heart failure prediction. What is heart failure? Obviously, it means the heart fails to function.
Heart diseases are the number 1 cause of death globally, taking about 31% of all deaths worldwide.

Where does this inspiration come from? It came from a class before and the professor said that the future of health and wellness industry is to use data to predict diseases, not to treat one after being diagnosed.

Why does heart disease happen? What factors are related to it? This is the purpose of this project.Several ideas came up.
Age definitely will be an important factor. If people get older, they tend to have a higher risk of getting all kinds of diseases.
Medical history should also be taken into account. If people have had heart attack before, they will definitely have a much higher chance of getting another one. Any related diseases might also give a higher risk, like diabetes.

This is what the story is about in this portfolio. Using data science to understand health industry better to save more lives if could. 10 questions are created and expected to be answered at the end of this portfolio.

10 Questions to be Answered

1. What are the factors that contribute to heart diseases?
2. Is age a significant factor?
3. Does gender cause a significant difference in chances of getting one?
4. Does education cause a significant difference?
5. Is smoking a significant factor?
6. If yes, what is the role played by cigarettes per day?
7. Is feeling chest pain ever a significant factor?
8. Does health insurace coverage have anything to with it?
9. What kind of medical history has the higher chance of getting one?
10.What can people do to minimize the risk of getting one and what activity is the most effective one?

Data Gathering

5 datasets are gathered and used in this project. 3 datasets came from Kaggle and 1 came from government website. API is also used to gather the air results of different areas.

Framingham Dataset

This dataset is publically available on the Kaggle website, and it is from an ongoing cardiovascular study on residents of the town of Framingham, Massachusetts..

It has 16 variables including education, age, gender and how many cigrettes per day. This dataset has over 4000 rows.

male: The person's gender(1 as male and 0 as female)
education: years of education
currentSmoker: 1 as smoker and 0 as not
cigsPerDay: How many cigs does this person smoke per day
BPMeds: Whether or not the patient was on blood pressure medication(1 as yes and 0 as no)
prevalentStroke: Whether or not the patient had previously had a stroke(1 as yes and 0 as no)
prevalentHyp: Whether or not the patient was hypertensive(1 as yes and 0 as no)
diabetes: Whether or not the patient had diabetes(1 as yes and 0 as no)
totChol: Total cholesterol level
sysBP: Systolic blood pressure
diaBP: Diastolic blood pressure
BMI: Body Mass Index
heartRate: Heart rate
glucose: Glucose level
TenYearCHD: 10 year risk of coronary heart disease CHD (binary:1 means yes and 0 means no). This is the target.


Framingham dataset





















UCI Dataset

This dataset is available on kaggle offered by UCI.

It has 14 variables and over 300 rows.

age: The person's age in years
sex: The person's sex (1 = male, 0 = female)
cp: The chest pain experienced (Value 1: typical angina, Value 2: atypical angina, Value 3: non-anginal pain, Value 4: asymptomatic)
trestbps: The person's resting blood pressure (mm Hg on admission to the hospital)
chol: The person's cholesterol measurement in mg/dl
fbs: The person's fasting blood sugar (> 120 mg/dl, 1 = true; 0 = false)
restecg: Resting electrocardiographic measurement (0 = normal, 1 = having ST-T wave abnormality, 2 = showing probable or definite left ventricular hypertrophy by Estes' criteria)
thalach: The person's maximum heart rate achieved
exang: Exercise induced angina (1 = yes; 0 = no)
oldpeak: ST depression induced by exercise relative to rest ('ST' relates to positions on the ECG plot.)
slope: The slope of the peak exercise ST segment (Value 1: upsloping, Value 2: flat, Value 3: downsloping)
ca: The number of major vessels (0-3)
thal: A blood disorder called thalassemia (3 = normal; 6 = fixed defect; 7 = reversable defect)
target: Heart disease (0 = no, 1 = yes)


UCI dataset























Clinic Dataset

This dataset is available on kaggle offered by a clinic.
It has 13 columns and around 300 rows.
One different thing about this dataset is this contains all patients who have already been diagosed with CVDs and the goal here is tro to predict the rate of death.

age: The person's age in years
anaemia: Decrease of red blood cells or hemoglobin (0 as no and 1 as yes)
creatinine_phosphokinase:It is an enzime in the body, when total cpk level is high, it most often means there has been injury or stress to muscle tissue, the heart or the brain
diabetes: Metabolic disease that causes high blood sugar(0 as no and 1 as yes)
ejection_fraction: Measurement of the percentage of blood leaving your heart each time it contracts
high_blood_pressure: Indicates if the blood pressure is high or not(1 as high and 0 as not)
platelets: Count of platelets in the blood
serum_creatinine: Level of serum creatinine in the blood (mg/dL)
serum_sodium: Level of serum sodium in the blood (mEq/L)
sex: Male or female(male as 1 and female as 0)
smoking: 1 as yes and 0 as no
time: Follow up period days
DEATH_EVENT: 1 as deceased and 0 as not


Clinic dataset























CVDs by County Dataset

This dataset is available on the website of U.S. Department of Health & Human Services and it represents mortality by counties. It has 13 variables and around 60000 rows.

Variables I will be using:
LocationAbbr:State abbreviation
LocationDesc:County names
Data_Value:Mortality rates

The rest of the variables have no influence on the target.


HHS dataset














API Data

API is used to access data about air quality results. The goal here is trying to see the relationship between air quality results and CVDs.
API dataset
API link

Data Cleaning


All codes and datasets

Framingham Dataset

Data types and Formats


1. Load the dataset first


2.Change the data types and formats if necesssary


Missing Values


1. Find out which column has NAs



2. Delete NA rows under some columns which has very few NAs



3. Fix the rest of rows which have NAs
For variable with outliers, I input median in NAs.
For variable without outliers, I input mean in NAs.
1)ggplot() is used to make boxplots to see which variable has outliers.


2) Input mean in NA rows of education



3) Input 0 in BPMeds



4) Input median in NA rows of the rest

Outliers and Incorrect Values


1. Use summary() to try to eyeball outliers or incorrect values and fix them



2. Use boxplot to detect outliers or incorrect values of other variables





There are outliers in these variables. However, they aren't necessarily wrong. Some research was performed and these outliers do have a chance to occur if you have really bad health conditions.


Duplicates

Use distinct() to remove all duplicates

UCI Dataset

Data types and Formats


1. Load the dataset in python



2. Change the data types and formats if necessary

Missing Values


There are no NAs in any column

Outliers and Incorrect Values


1.Use DataFrame.describe() to try to eyeball outliers


Unfortunately, no outliers or incorrect values can be detected at this time.

2. Use Seaborn to do boxplot for these variables in python




Except for variable thalach, all outliers in other variables do not need to be removed or replaced.
Outliers in thalach are removed



Duplicates


Use DataFrame.drop_duplicates() to remove duplicates



There are no duplicates in this dataset

Clinic Dataset

Data types and Formats


1. Load the dataset in python



2. Change the data types and formats if necessary

Missing Values


There are no NAs in any column

Outliers and Incorrect Values


1.Use DataFrame.describe() to try to eyeball outliers


Unfortunately, no outliers or incorrect values can be detected at this time.

2. Use Seaborn to do boxplot for these variables in python




Two variables' outliers are removed, ejection_fraction and serum_sodium. The reason is that they have very few outliers. so removing these outliers won't affect the dataset too much.
For other variables, those outliers are likely to occur. This is why these outliers aren't removed.


Duplicates

Use DataFrame.drop_duplicates() to remove duplicates



There are no duplicates in this dataset

CVDs by County Dataset

Data types and Formats


1. Load the dataset first



2. Pick columns that are useful for me.



The rest of the columns are not useful. Knowing the year, the longtitude and latitude does not help the model.

3. Change data types and formats if necessary



Missing Values

1. Find out which column has NAs



2. There are so many NAs, so they can not be deleted as this will immediately destroy the dataset and the model.

ggplot() is ued to make a boxplot. If there are outliers, median will be input in those NAs. Otherwise, mean would be input.



Outliers and Incorrect Values


Use boxplot() to find if there are any outliers and incorrect values



There are many outliers,simply deleting them will destroy this dataset. Besides, they aren't necessarily wrong. This dataset represents mortality rates by county which is a very small range.
Thus, extreme values could happen.


Duplicates

Use distinct() to remove all duplicates



There are 24,000ish duplicates and they are all removed.

Clustering

All codes and visualizations

Data Overview

In this section, 2 files will be created for clustering analysis using Python and R. One is record dataset and the other is text-based data.

Record dataset

This dataset is created by deleting certain columns of UCI dataset and binning different ages of people to different age group (0-35:young, 36-50:middle and 51 or greater:old)



Text-based Dataframe

There are 7 text files created using HHS dataset. They are some statements of the death rate of heart attack for some states. (AK stands for Alaska and AR stands for Arkansas and HHS is a general statement of the dataset)

Clustering in Python

1.Load and prepare the text data

2.Vectorize the data and convert it to dataframe

3.Cluster with number of clusters as 2 and 3

4.Visualizations of cluster

5.Load record dataset and remove the label

6.Cluster with n=2 and Visualization

7.Cluster with n=3 and Visualization

8.Euclidean Distance

9.Manhattan Distance

10.Cosine Distance

11.Heat Map

Clustering in R

1.Load and prepare the record dataset

2.Find optimal k

3.Use 3 different methods to check

Top right graph is from Silhouette.
Bottom left graph is from Elbow
Bottom right graph os frpm Gap.
As we could see, k=2 and k=4 are the optimal ones.

4.Use 3 different distance metrics with k=4 and k=2





5.Load and prepare the corpus data

6.Convert it to matrix

7.Normalize

8.Find optimal k

9.Visualize k means

10.Wordcloud

Clustering comparison

k-means clustering is a method of vector quantization, originally from signal processing, that aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster.

Density-Based Clustering refers to unsupervised learning methods that identify distinctive groups/clusters in the data, based on the idea that a cluster in a data space is a contiguous region of high point density, separated from other such clusters by contiguous regions of low point density.

Hierarchical clustering, also known as hierarchical cluster analysis, is an algorithm that groups similar objects into groups called clusters. The endpoint is a set of clusters, where each cluster is distinct from each other cluster, and the objects within each cluster are broadly similar to each other.

In mathematics, the Euclidean distance or Euclidean metric is the "ordinary" straight-line distance between two points in Euclidean space

Manhattan distance is the distance between two points measured along axes at right angles.



The Minkowski distance or Minkowski metric is a metric in a normed vector space which can be considered as a generalization of both the Euclidean distance and the Manhattan distance.

ARM and Networking

All codes and visualizations

1. Dataset

For this section, Framingham dataset will be used and convert it to a transaction dataset.
First, original dataset is loaded after data cleaning and then use summary to see the distribution of each variable. A few variables are chosen out of 16 categorizes into different groups.
male: female and male
education: elementary, middle and high
currentSmoker: nonSmoker and Smoker
cigsPerDay: none, few, normal and plenty
BPMeds: noMeds and Meds
prevalentStroke:nostroke and Stroke
diabetes: noDiabetes and Diabetes
TenYearCHD: noTarget and Target.



2. ARM

First, possible rules are generated within the parameters.
Then the top 15 rules are sorted by support, confidence and lift.

R Codes

Support



The maximum value of support in these rules is 0.968 which is pretty big.
From the graph, we can tell that {noDiabetes}=>{noStroke} and {noStroke}=>{noDiabetes} have the greatest support. This means that these two appear most in the dataset at the same time.
For example, if we have 100 transactions and these two will appear in almost 97 transactions.
This means {noDiabetes} and {noStroke} almost always appear together, which probably indicates that if you do not have diabetes, you are not likely to have stroke.


Confidence



The maximum value of confidence in these rules is 0.9977.
From the graph, we can tell that {nonSmoker}=>{none} has the greatest confidence.
Confidence measures the conditional probability. In this case, given that this person is a not a smoker,
the probability of this person does not smoke is 1.
{none} always appears together with {nonSmoker}.
This 100% makes sense. If you are not a smoker, then of course you won't smoker.


Lift



The maximum value of lift in these rules is 1.955.
Lift in arules measures indepence. The smallest value of lift is 1 and it indicates transaction A and B are independent.
The greater the lift, the greater the chance A and B will occur together.
From the graph, we can tell that {female, none}=>{nonSmoker} has the greatest lift. This means that this kind of association rule is the strongest among others and they are very dependent.
What this means is that if you are a female and smoke zero cigerettes per day, there is a greate chance you are not a smoker.


3. Visualizations

3 visualizations are created- scatter plot of lift, interactive graph of networking and Network D3

Scatter plot of lift




This is a scatter plot of showing top 10 rules with the most values of lift. The x-axis is support and the y-axis is confidence.
With the value of lift increasing, support and confidence all increase.
All these rules are very strong because the values of lift are bigger than 1, which means these rules are correlated and not independent.


Interactive Network Graph


Here is an interactive network graph showing the relationship among the top rules.
The area of and the color of the red circle show the strength of the association.
This corresponds to the sorted top 15 rules by lift.{female, none}=>{nonSmoker} has the greatest value of lift.

Network D3

Network D3 Visualization
Here is a Network D3 graph showing the relationship among the rules within the parameters
The largest circles are noStroke, noMeds and noDiabetes and many edges are connected to them.
If we mess around with it, we will find that {noMeds, noStroke, noDiabetes}=>{noTarget}is the strongest rule.

4. Conclusion

Association rule mining is a great way to discover interesting relations between variables in large databases. Three measures are needed- support, confidence and lift.
The value of lift has to be greater than 1 for that rule to be a valid one. For example, {noMeds, noStroke, noDiabetes}=>{noTarget} is a strong association. This means that if one is not taking blood pressure medicine, do not have stroke before and are not diabetes, he or she will probably not have heart disease.

Confidence measures the probability of one transaction appear given another transaction appears. The maximum value of confidence is 1, which means that these two transactions always appear at the same time. From the graoh above, {nonSmoker}=>{none} has a confidence of 1.This makes a lot of sense as nonsmoker smokes zero cigarettes, so transaction {none} will always appear if {nonSmoker} appears.

Support measures the probability of a transaction occuring. Within my parameters, {nonSmoker}=>{none} has the higest value of support, which means that these two transaction appear the most time among all transactions.
If you are not a smoker, you won't smoker. This is why these two transactions have the greatest support and almost always appear together.

Decision Trees

All codes and visualizations

Introduction

In this section, decision trees will be used to analyze for 2 kinds of datasets using Python and R. Python is used to analyze text data (corpus) and R is used to analyze record datasets with labels.

Datasets

Text data

For the text data, 2 corpus files are created about this topic.
The first corpus file (corpus) contains 7 text files, which are statements about the CVDs by county dataset.
The second corpus file (corpus1) contains 4 text files, which are general descriptions of the datasets.

Record datasets

2 datasets are chosen for this section- The Framingham dataset and the UCI dataset.


Analysis in Python for text data

Python code

For this section, four vectorizers are created and it's necessary to check which vectorizer could better perform the decision trees of the text data.
Two of the vectorizers are with stemmer and two are not.
Vectorizers with stemmer will analyze the stem of a word.
For example, if fish, fisher and fishing occur in the text data, they will all be fish.
Vectorizers without stemmer will input each word as it is.

Confusion Matrices

Text data is split into train sets and test sets, so decision trees can be used to predict the test data and compare the results.
Let's take minute to review what confusion matrix is first.



In this graph, the true class which is the true values, form the column of the matrix. The predicted values form the columns.
TP stands for True Positive. For example, if we predict this person is a target for heart disease, and indeed he is a patient.
TN stands for True Negative. If we predict this person is not a target and we get it right. TP and TN are the results that we correctly predict.
FP stands for False Positive. If we predict this person is a target but he is not.
FN stands for False Negative. We predict this person a non-target but actually he is. FN and FP are the results that we predicted wrong.

Now let's take a look at the 4 confusion matrices and the important features.


Let's take a look at the first matrix. The value for TP is 0 and for TN is 1. The model only predicted 1 right in total, and the reason might be that the chosen text data is very small and in this way, a good model can not be trained with very tiny train data.
For this model, the most important feature is uci, as one can tell from the values.
Feature importance is calculated as the decrease in node impurity weighted by the probability of reaching that node. The node probability can be calculated by the number of samples that reach the node, divided by the total number of samples. The higher the value the more important the feature.
To put this in a non-technical way, the greater the value of feature importance, the better this word can predict the test data.
If we take a few seconds and we will find out this actually makes sense. If we choose uci as the node, we can be sure that there is probably only one file contains uci, which is the text file describing UCI dataset.
These features are sorted in a descending order.

Decision Trees Visulizations

Python package called "graphviz" is used to draw the decision trees and random forest trees for my model.
The first one is for the model with stemmer. There is one decision tree and and a random forest with 5 estimators.


From this graph, we can see that Arkansas is the top node. If it is used, those files which do not have Arkansas will be categorized into corpus1. Then this node is a pure one, as we can tell from the gini value. When gini is 0, all the values in this node will belong to one class and can not be split again.
After this, mortal is used and it is the stemmer of mortality. All files do not contain mortal will be categorized into corpus.
Then, dataset is used and so on.

Now let's look at the trees for the model without stemmer.


From this one, the first node is the same as the graph for model with stemmer.
Then management is used and so on.

Feature Importance Visualization

After looking at the decision trees, let's visualize feature importance in random forest.
Top ten features are printed out in a descending order and rounded to two digits.
Then they are visualized.



Analysis in R for record data

R code

For this section, R is used to apply decision trees and random forest with two datasets.
One is Framingham dataset and another is UCI dataset.
First, some of the variables are categorized into different categories, like classifying gender as male and female instead of 0 and 1.
Then the dataset is split into train sets and test sets.
Also, confusion matrix is created.

Framimgham Dataset

Decision Tree Visulization

This dataset has over 4000 rows and I split my data into one train set containing around 3200 rows and one test set consisting of around 800 rows.
First, decision tree is created using rpart() and fancyRpartPlot() functions.



At the first glance, one might think there was something wrong because when looked at the tree for the first split, both sides are noTarget, which did not make sense.
However, when dig more into the tree, this was not necessarily wrong.
Let's explore this tree together :)
The first node is sysBP, recall that it means systolic blood pressure.
At the value of 136, the data will be split to two parts, both are nonTarget.
Let's go one step further and focus on the side which sysBP is smaller than 136.
The data could be split once again at the value where age equals to 47.
This time, let's focus on the side where age is greater or equal to 47.
Here, the data could be split once again at the value where glucose equals to 145.
Finally, we have some useful findings here.
If you are more than 47 years old and has a glucose level greater than 145, you should be a target for heart disease.
Now, let's summarize our analysis for this sub tree.
If your systolic blood pressure is less than 136 and are younger than 47, you won't be a target. This makes sense, because young people with a stable blood pressure are low risk group for heart attack.
If your systolic blood pressure is less than 136 and are older than 47, but your glucose level is less than 145, you won't be a target. This indicates that glucose is a decisive factor in this group of people.
If your systolic blood pressure is less than 136 and are older than 47 with a glucose level greater than 145, you will be a target. This tells us in our daily life, we have to be careful of our glucose level.
The analysis for all other sub trees is exactly the same. The decision tree for this dataset is very complicated, because heart disease can be caused by many factors, and sometimes all these factors contribute simultaneously. This is why we saw this kind of tree above.

Confusion Matrix and Random Forest

Confusion matrix is created using naive bayes.



From this matrix, we can tell that the accuracy of this model 0.8156, which is in an acceptable range.
If we add all these number up, we will get the number of rows of our test data, which is 846.
The number of true positive predictions is 23 and the number of true negative predictions is 667.
In total, the model predicts 690 cases right. If 690 by 846, we will get the same number of the accuracy, which is 0.8156.

Random forest is also used to print out the most important feautures of this dataset.



If we look at this graph, we will find out that the most important feature is sysBP, because it has the largest value of MeanDecreaseAccuracy.
This is why sysBP is used as the first node.

UCI Dataset

This dataset has over 300 rows and is split into one train set containing around 250 rows and one test set consisting of around 50 rows.
First, decision tree ois created using rpart() and fancyRpartPlot() functions.



Let's explore this tree. Recall that cp is chest pain and it has value of 0-4, with 0 being no chest pain and 1-4 being different kinds of pain.
When you have no chest pain, you won't be a target. However, if you have ca <= 0.5 and maximum heart rate achieved greater than 147, recall that ca is the number of major vessels, you are a target.
If your maximum heart rate achieved is less than 147 and experienced angina (a kind of chest pain caused by exercise), you are not a target.
This is kind of the opposite to our common knowledge. We tend to think that people experienced angina will likely be a target.
One possible explanation is that those people experienced angina will be very careful of their health condition and therefore, they reduced their risk.
The dataset being small could also be one reason.

Confusion Matrix and Random Forest

Confusion matrix is created using naive bayes.



From this matrix, we can tell the accuracy of my model is 0.84. In this model, 42 cases were predicted right out of 50, so the accuracy is 0.84 just like R calculated.

Random forest is also used to print out the most important feautures of this dataset.



If we look at this graph, we will find out that the most important features are ca and cp, because they have the largest values of MeanDecreaseAccuracy.
This is why ca and cp are used as the first two nodes.

Conclusion

First, let's review our findings in this part. For heart disease, we found out some important features using decision trees and random forests.
According to the random forests, the most important features are sysBP, age, glucose and so on.
We can see this from the random forest and also from our decision trees, as these features are the top nodes.

What does this mean for our daily life?
From the important features, we know that blood pressure, age and glucose are very important for prediciting heart disease.
In our normal life, we have to be really careful about our blood pressure, especially when we are getting old, like in our 50s.
One good way is to check our health conditions periodically so we can keep track of our own body.
Besides, we need to keep a good diet to control our glucose levels, like doing daily exercises and controlling carb intake.

Naive Bayes

All codes and visualizations

Naive Bayes Introduction

Before we step into the analysis of Naive Bayes in R, I'd like to give a brief introduction of Naive Bayes formula. It's always good to know what we are really doing instead of extracting some packages and running some functions we don't really understand.
Naive Bayes Formula calculates the conditional probability of something happens.
The formula looks like this


Naive Bayes in R

R code

Text Data


Data Description

Text Data

For this time, a different corpus file is used than last time.
The goal here is touse different data to see what are the effects of using Naive Bayes.
This data contains 13 text files which are novels by different authors.

Analysis

Convert Corpus File to Dataframe

In order to do Naive Bayes analysis, raw text has to be converted into a document term matrix.
Then this matrix needs to be converted into a dataframe with document names being rows and each word being the column (variable).


Running Naive Bayes

First, the data is split into a training set contaning 8 files and testing set contaninig 5 files. Then the labels of these two datasets are stored and naiveBayes() is run.



Since the results of Naive Bayes is super long, this is a screenshot of a part of it.
From this graph, we can know the distributions of three words or variables-- abbey, abbeyoh and abbots.
If we look at the table, the rows are the names of the documents. The columns of means and standard deviations of each variable.
The idea behind this is that we assume that the data for each class is normally distributed and so knowing the mean and standard deviation, we can compute the probability of any given value.
For variable abbey, it does not appear in Austen and CBronte novels. The mean number of appearances in Dickens novels is 1 with a std of 1.414.

Thenva matrix of predicted label and test label is created to see the accuracy of this prediction.



There are 5 predicted in total and 2 of 5 predictions are correct. The accuracy of this matrix is 40%.

Finally, here comes the visulization part. 2 plots are created. The one is a plot with predicted labels. The other is a plot with actual labels.



Record Data


Data Description

Record Data

One record dataset from the previous datasets is used.
Framingham dataset has over 4000 rows and 16 variables with one decision class.

Analysis

Running Naive Bayes

First, the record dataset is split into a training set containing around 3000 rows and a testing set containing aroud 1000 rows.
Then the labels is stored to run naiveBayes().



From this graph, we can first tell that about 85% people in the training set have no heart disease.
Then, from the tables we can find out some useful distributions of the variables.
For variable male, about 41 percent of people in the training set who have no heart disease are males. 53% of people in the training set who have heart diseases are males.
For variable age, the mean number of ages of people who do not have heart disease is around 49. The mean number of ages people who have heart disease in the training set is around 54.
From this distribution, we can tell that in some degree, age might be a siginificant factor to heart failure.

A matrix of predicted labels and test labels is also created.



From this matrix, we can tell that the true negative and true positive cases are 886, which giving us an accuracy of 83.82%.
This is actually a good prediction.

Finally, 2 visulizations are created. The one is a plot with predicted labels. The other is a plot with actual labels.



These two plots look pretty identical, suggesting this is a good prediction.


Naive Bayes in Python

Python code

Text Data


Data Description

Text Data

This time, one more corpus file is added. Now the labels become the name of the 2 corpus files, not the file names inside the corpus file
The new corpus file contains 11 files which are descriptions of the datasets used in this portfolio.

Analysis

Vectorizing Text Data and Converting it to Dataframe




This dataframe has each word as each column. The rows are different files.
The values indicate how many times this word appears in this file.
As we can tell, some word appears over 2000 times and some word appear only once.


Running Naive Bayes

First the data is split into a training set and a testing set. Then the labels is stored and removed from each set.
After running Naive Bayes in python, let's take a look at the confusion matrix first.



This matrix gives an accuracy of 75%. With testing and training sets being so small, this model performs very well with training and predicting.
Now let's visualize our predictions and the actual results.



At first glance, we might think these two plots are the same.
However, they are not. The first one takes label "Novels" as its first column and the second one takes "corpus" as its first column.
These two plots correspond to the confusion matrix.

Record Data


Data Description

Record Data

Same record dataset is used as in R.

Analysis

The step is still spliting the record dataset to training set and testing set.
The testing set contains around 1000 rows and the traning set contains around 3000 rows.

Instead of making a confusion matrix, a heat graph is created using confusion matrix so it can be better visualized.
One reason for doing this is that in this way, we can have a graph showing both the labels and the values in a better way.



In this graph, we can tell that there are 576 true negative cases, 94 true positive cases, 494 false positive cases and 105 false negative cases.
This gives us an accuracy of 53%.
Obviously, this is not a very good model compared to what we did in R. Using Naive Bayes in R for this record dataset, we have an accuracy of 83.82%.
Naive Bayes in R wins this one!

Now let's look at the predictions and realities seperately.



The model predicted that number of people with heart disease and number of people without it are pretty much the same.
However, by comparing with the reality, we will find out that this model actually performs not so well.

Conclusion

For text data, Naive Bayes actually works better in python.
The accuracy of the prediction is 75% in python while it is only 40% in R.

For record data, we can find many useful information from Naive Bayes in R.
Naive Bayes works better in R for the record data.The accuracy in R is 83.82% and it is only around 50% in python, so R is the winner here!
From the Naive Bayes stats in R, we can find many useful information from Naive Bayes in R.
We can first tell that about 85% people in the training set have no heart disease.
For variable male, about 41% percent of people in the training set who have no heart disease are males. 53% of people in the training set who have heart diseases are males.
For variable age, the mean number of ages of people who do not have heart disease is around 49. The mean number of ages people who have heart disease in the training set is around 54.

SVM

All codes and visualizations

SVM Introduction

A support vector machine (svm) is a supervised machine learning model that uses classification algorithms for two-group classification problems.
Each svm can classify data to two classes. If we have 3 classes, we will need two svms.

SVM in R

Text Data


Data Description

Text Data

The same text data is used as in Naive Bayes section

Analysis

R code

Polynomial SVM

svm() in R is used to run polynomial SVM with cost equals to 1.
Here is the matrix with predicted labels and actual labels.



From this matrix, we can tell that this SVM predicted 3 right out of 5 files, which gives us an accuracy of 60%.
This is actually better than the one we did in Naive Bayes.

Now, let's visualize our predictions and the actual labels.



Let's try other SVMs to see if there is a better model :)


Linear SVM

Here, running svm() with cost equals to 1 and SVM kernel being linear.
Let's see the matrix first.



From this matrix, we actually have a prediction of accuracy being 100%. Linear SVM is perfect for this one!

Now, let's see the two plots for predictions and actual labels.



Obviously, these two plots are identical, suggesting linear SVM might be the best model for this data.


Radial SVM

Here running svm() with cost equals to 10 and SVM kernel being radial.



This matrix gives us an accuracy 40%, so radial SVM does not work so well with this data.

Now, let's see the two plots for predictions and actual labels.




Summary

After running SVM with 3 different kernels- polynomial, linear and radial, it was found out that linear SVM actually works best with the text data here.
How do we know which kernal works the best?
If we don't know anything about our data, we will approach with linear kernel first, assuming our data is linear separable.
If linear kernel does not work out, then we can switch to polynomial or radial kernel after we understand more about the dataset.

Record Data


Data Description

Record Data

The same record is used data as in Naive Bayes section

Analysis

Polynomial SVM

Running polynomial SVM with cost equals to 10.
Here is the matrix with predicted labels and actual labels.



This matrix gives us an accuracy of 85.43% and this is actually slightly better than the one we did in Naive Bayes.

Let's visualize the polynomial SVM in R.



age and cigsPerDay are chosen as the axises for this graph as in this way the interactions between age and cigrettes smoked per day can be seen.
0 is no target and 1 means this person has heart disease.
As we can tell, most of the people smoke very few cigrettes per day as most of the data points are distributed around the left part of the graph.
Also, as age increases, even if you smoke few cigrettes, you still have a greater chance to suffer from heart attack.
Let's explore other SVMs with different kernels.


Linear SVM

Running linear SVM with cost equals to 10.
Here is the matrix with predicted labels and actual labels.



This matrix gives us an accuracy of 84.86%, which is slighly less accurate than polynomial SVM.

Let's visualize the linear SVM in R.



This is not a good SVM as we can tell that there is no classification here. Linear approach can not classify our record data very well, or not even at all.


Radial SVM

Running radial SVM with cost equals to 10.
Here is the matrix with predicted labels and actual labels.



This matrix gives us an accuracy of 84.30%. Among these 3 SVMs, we can tell that polynomial SVM gives us the best accuracy.

Let's visualize the radial SVM in R.



This is also not a good SVM and is very similar to linear SVM. We can hardly see a classification here.


Summary

Among these 3 SVMs, the best one is polynomial SVM as it has a clear classification.
However, we can see that the data is still not well classified. One method to improve this is to reduce the dimensionality of the dataset.
Not all variables are useful for our analysis. By reducing the number of columns, we might have a good classification using SVM.
Another way is to try different costs to see if anything can be improved.

SVM in Python

Python code

Text Data


Data Description

Text Data

This time, one more corpus file is added. Now the labels become the name of the 2 corpus files, not the file names inside the corpus file
The new corpus file contains 11 files which are descriptions of the datasets used in this portfolio.

Analysis

As the text data is already prepared in Naive Bayes section, it will be used directly from Naive Bayes


Linear, Radial and Polynomial SVM

These 3 SVMs give exactly the same results, so I will put them together in one part.

First, let's take a look at the heat map (a better version of confusion matrix) for these 3 SVMs.



This gives us the accuracy of 100%! There are 8 predictions in total and all of them are correct.
We can prove this by looking at the predictions and realities separately.




We can tell that these 4 graphs are exactly the same except the color :)
In this case, all 3 SVMs work perfectly.

Let's take a look at the most important features of the data.
This visualization is only for linear SVM as other SVMs do not have this property.



The top three features are "wa", "hi" and "s".

Summary

All 3 SVMs work very well with our text data. They predicted all 8 cases right.
One way to really see which SVM works better is to increase the number of files to see which one is the best classifier.

Record Data


Data Description

Record Data

The same record data is used as in Naive Bayes section

Analysis


Linear SVM

Linear SVM function is used in python with cost equals to 10.

First, let's look at the heat map.



This actually is a surprise as normally one do not expect that this linear SVM to be this bad :(
From this heat map, we can see that only 247 were predicted correctly over 1000 predictions.
This one only gives an accuracy of 19%.

Let's look at the linear SVM graph.



We can immediately tell that this SVM is not a good one. From the heat map, we know that there are over 1000 cases in class group "0".
However, this graph shows that most of the data points are in the group of "1".
Clearly, SVM with linear kernel does not work for this one.
Let's explore two other SVMs and hope they won't let us disappointed.


Radial SVM

Radial SVM function is used in python with cost equals to 10.

First, let's look at the heat map.



This one is much better! Almost all cases were predicted correctly.
The accuracy for this one is 85%, which is pretty high.
One interesting thing about this SVM is that it predicted all cases to be 0.

Let's look at the radial SVM graph.



This actually looks good. From the graph, we can tell that most of the people in the testing set are 40 to 60 years old.
Most of them smoke cigrettes less than 20 per day or do not smoke at all.
However, those people who are old and smoke lots of cigrettres per day are also classified as group "0".
This is actually reflected by those 189 false negative cases.


Polynomial SVM

Polynomial SVM function is used in python with cost equals to 10.

First, let's look at the heat map.



This one is almost the same as the radial SVM.
It also predicted all cases to be 0 and the accuracy is also 85%.

Let's look at the polynomial SVM graph.



The interpretations of this one is pretty much the same with radial SVM.


Summary

The best option here will be using radial SVM or polynomial SVM. They all work very well with the data.
The linear SVM should not be used, as it gives a terrible classification and accuracy.

Conclusion

For text data in R, linear SVM is the best as it has an accuracy of 100%. polynomial SVM has an accuracy of 60% and radial SVM has an accuracy of 40%.
In python, all 3 SVMs work perfectly with the text data as they all have an accuracy of 100%.
Python is the winner here!

For record data in R, polynomial should be the best option. It has the highest predicting accuracy (85.43%) and its SVM graph can at least show a clear classification.
In python, the best options should be radial SVM or polynomial SVM as they all have an accuracy of 85%.
Linear SVM should be thrown away immediately because it has a terrible accuracy (19%) and classification.
This time R and python are tied.

Conclusion

After the analysis of custering, ARM, decision trees and so one, it's time for conclusions of this project.
First, it's necessary to take a look at the 10 questions created in the Introduction section and see how many of them got answered.

The first question is very generic and it concerns about what factors are related to heart disease. Basically, all the variables in the datasets gathered have some degrees of influences on the prediction of CVD. There are some specific variables that are very crucial in the prediction. This can be seen in the section of Decision Trees. Blood pressure is one of the most important indicators. There are several ways to control the blood pressure at a stable level. Maintaining a healthy life style like doing daily exercises, keeping a balanced diet and not staying up late will be helpful.

The second, third and fourth question regarding the significances of age and gender can be answered using evidence from Decision Trees and Naive Bayes sections. From the section of Decision Trees, the feature importance in random forest shows that age has a pretty high importance among all the variables.



In this graph, one can tell that age has a top 3 feature importance in the dataset, which means it is the third most important indicators besides blood pressure.
From the section of Naive Bayes, the results of function naiveBayes() shows the influence of gender.



In this graph, looking at the distribution of variable gender could be informative. Among the people who do not have heart diseases, 41 percent are males. 53 percent of the people who are target are males. This indicates that males are a little bit more susceptible to heart disease, although the difference is minimal.
One possible explanation behind this is that most of the males have unhealthy lifestyles like smoking and drinking.
Refering back to the feature importance graph, education does not really cause significant difference when predicting heart diseases. Education variable has the least feature importance and it's almost 0. Also, it does not even appear in the decision tree plot as it can not split the dataset very well or not at all.

It's necessary to refer back to the decision tree plot and feature importance plot in order to answer question five and six.



The random forest feature importance shows that smoking or not is not that importance in predicting whether one is a target. However, the number of cigrettes smoked per day has some degrees of importance that can not be ignored.
This tells people that being a smoker or not isn't that important. What matters is how many cigrettes are smoked. If one smoke one cigrette per day, the he or she might not become the victim of heart disease.
From the decision tree plot in R, it's not hard to find out that the number of 13 is pivotal. If one smokes more than 13 cigrettes per day with glucose level greater than 70, then he or she will likely to have heart disease.

For question number 7, the answer could be found in the decision tree analysis of the UCI dataset.



Here, cp (chest pain) has the second highest feature importance. There are several types of chest pain and in the decision tree plot, no matter what kind of the chest pain is, it all has significant influence in the prediction.

Medical history could be classified into many categories. There are some related medical histories in the datasets, like stroke and diabetes.
Referring back to the feature importance graph of the Framingham dataset would be helpful.



In this plot, diabetes has a higher importance than stroke. This tells people that if being diagnosed with diabetes, they will need to be very careful of their body.

After learning which factors are important, it's essential to find out how to stay healthy according to the variable importances that are just went over.
First, there are two things that people can not change. One is age and the other is sex. If people are getting older especially whey they are males, they have to be careful maybeing doing body check periodically will minize their risks.
Smoking is a bad habit and it's hard to quit. However, reducing the number of cigrettes smoked per day is necessary if it's too difficult to quit smoking. Smoking no more than 10 cigarettes per day will reduce the risks.
If people ever experienced any kind of chest pain, they need to stay alert especially. Chest pains can be the symptoms of early heart attack. Not staying up late and excersing too much are crucial.
People who are diagnosed with diabetes will have a higher risk, so they have to control their diet to stay away from diabetes in order to minimize their risks.

Health is a big issue that people can not neglect. People have to be alert from now not after being diagnosed as there is no way of going back. If you ever felt uncomfortable, please go to the doctor. Sometimes it takes just a few seconds to take one's life.