ource: Thinkstock

Are Healthcare charges higher depending on demographic varieties and smoking status?

The cost of healthcare in the United States has been on the rise over the years. The amount an individual spend on health charges every year is potentially based on several factors. I will analyze the Health dataset found on Kaggle to identify weather, age, gender, Body Mass Index (BMI), smoke status and region of the United states affects the yearly charges for health care.

The dataset contains 7 features and 1338 rows. I identified 7 outliers and removed them from the dataset. BMI categories were created using the BMI chart found on BMI Calculator. As displayed on image 1, the data set contains a balanced distribution of females and males. Likewise, the distribution of BMI across both genders are balanced.

I performed independent T-Test on three different variables: Sex & charges, BMI & charges, and Smoke status & charges. The p value for charges according to sex is 0.03. We can reject the fact that there is no relationship between a person’s sex and insurance charges with a confidence level of over 95%. Moreover, Histogram 1 shown below clearly express that males tend to pay higher charges compared to female.

Likewise, the p-value for both charges according to BMI and Smoke status is nearly 0. This suggest that the null hypothesis can be rejected and conclude that indeed, there is a significant difference between BMIs and charges as well as smoke status and charges.

Image 2 shows a linear relationship between age and charges. There are three clusters of linearity between age and charges. By identifying the smokers and non smokers, we can appreciate that in the first group (Lowest charges), the linear relationship does not show any smokers. While in the second cluster, there are a combination of smokers and non smokers. Lastly, the third cluster show that those who smoke have higher insurance charges. We can conclude that as a person’s age increases, insurance charges increases.

Similarly, comparing BMI and charges group by smoke status, shows that independently from ones BMI, smokers are more likely to have higher insurance charges than those who do not smoke. See image 3 below.

The histogram below clearly demonstrate that responders that said they are smoker, pay higher amounts compared to non smokers. Smoking status affects individuals annual healthcare charges.

Image 4: Boxplot of BMI categories and Charges

A normal Body Mass Index not only represent that we are in our best body shape, BMI can demonstrate the likelihood of developing conditions that will incur in visits to healthcare professionals. After analyzing the dataset, It is clear that those whose BMI is in the ‘Normal’ Scale tend to have less insurance charges than those who are not in the ‘normal scale’.

We can observe on image 5 that the mean across each region is slightly similar. Likewise, the outliers are in the same range. Since the notches across all four regions overlap, we can conclude that the true median do not differ.

In the heatmap below, we can compare the Ratio of different features. It is clear that Smoke status impacts the charges for health premiums. Age and BMI are other contributing factors. Nevertheless, it is demonstrated that sex, number of children and region do not have a large significance over insurance charges.


  • There is a relationship between sex and charges according to t test performed and histogram 1. Males are paying higher charges compared to females.
  • Body mass index has a significant relationship to increased insurance charges
  • Age has lineality relationship with charges which can be related with other factors such as smoke status.
  • Smoke status has a significant relationship with charges. Smokers tend to have higher charges compared to those who do not smoke.
  • The mean charges across the 4 main land regions of the United States are slightly similar.

Here is the link to my notebook where the dataset was analyzed