Vinit Khandelwal

Posts

Showing posts from January, 2019

Breast Cancer Tumor Machine Learning Prediction Using Scikit Learn

Breast Cancer Machine Learning Prediction. Used Scikit Learn for Training, Evaluating, and Prediction. Used Seaborn and Matplotlib for Visualizing. Run this on Jupyter Notebook # LIBRARY IMPORTS import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns %matplotlib inline # DATASET IMPORT from sklearn.datasets import load_breast_cancer cancer = load_breast_cancer() #VIEW DATA cancer cancer.keys() print(cancer['DESCR']) print(cancer['target_names']) print(cancer['target']) print(cancer['feature_names']) print(cancer['data']) cancer['data'].shape df_cancer = pd.DataFrame(np.c_[cancer['data'], cancer['target']], columns = np.append(cancer['feature_names'], ['target'])) df_cancer.head() df_cancer.tail() # VISUALIZE DATA # SEABORN PAIRPLOT sns.pairplot(df_cancer, hue = 'target', vars = ['mean radius', 'mean te...

Tensorflow Placeholder with Example

Tensorflow Placeholder with Example of Calculating Area of Triangle. Run the code here: https://repl.it/@VinitKhandelwal/tensorflow-placeholder-area-triangle import tensorflow as tf def compute_area ( sides ): a = sides [:, 0 ] b = sides [:, 1 ] c = sides [:, 2 ] s = ( a+b+c ) * 0.5 areasq = s* ( s-a ) * ( s-b ) * ( s-c ) return tf.sqrt ( areasq ) with tf.Session () as sess : sides = tf.placeholder ( tf.float32 , shape= ( None , 3 )) area = compute_area ( sides ) result = sess.run ( area , feed_dict= { sides :[[ 5.0 , 3.0 , 7.1 ], [ 2.3 , 4.1 , 4.8 ]]}) print ( result ) OUTPUT [6.278497 4.709139]

Eager Execution in Tensorflow with Example

Here is how eager execution is done in Tensorflow. Run the code here: https://repl.it/@VinitKhandelwal/tensorflow-eager-area-triangle Example: Calculating area of triangles passed as matrix and getting output with eager execution. import tensorflow as tf from tensorflow.contrib.eager.python import tfe tfe.enable_eager_execution () #TODO: Using your non-placeholder solution, def compute_area ( sides ): #TODO: Write TensorFlow code to compute area of a a = sides [:, 0 ] b = sides [:, 1 ] c = sides [:, 2 ] # SET of triangles given by their side lengths s = ( a+b+c ) * 0.5 areasq = s* ( s-a ) * ( s-b ) * ( s-c ) return tf.sqrt ( areasq ) # try it now using tf.eager by removing the session area = compute_area ( tf.constant ([[ 5.0 , 3.0 , 7.1 ], [ 2.3 , 4.1 , 4.8 ]])) print ( area ) OUTPUT tf.Tensor([6.278497 4.709139], shape=(2,), dtype=float32)

Data Visualization using Google Datalab, BigQuery, and Cloud Shell

Created a query to fetch data to visualize query=""" SELECT departure_delay, COUNT(1) AS num_flights, APPROX_QUANTILES(arrival_delay, 10) AS arrival_delay_deciles FROM `bigquery-samples.airline_ontime_data.flights` GROUP BY departure_delay HAVING num_flights > 100 ORDER BY departure_delay ASC """ import google.datalab.bigquery as bq df = bq.Query(query).execute().result().to_dataframe() df.head() import pandas as pd df['arrival_delay_deciles'].head() percentiles = df['arrival_delay_deciles'].apply(pd.Series) percentiles.head() percentiles = percentiles.rename(columns = lambda x : str(x*10) + "%") df = pd.concat([df['departure_delay'], percentiles], axis=1) df.head() without_extremes = df.drop(['0%', '100%'], 1) without_extremes.plot(x='departure_delay', xlim=(-30,50), ylim=(-50,50)) without_extremes.plot(x='departure_delay')

10 Steps to Data Wrangling for Data Analysis using Pandas

10 Steps to Data Wrangling for Data Analysis using Pandas Step 1: Import Pandas import pandas as pd Step 2: Have a DataFrame created using pandas df = pd.read_csv('sample_data.csv') Step 3: Count null values df.isnull().sum() # gives you count of null values in each column of the dataframe Step 4: Plot on heatmap import seaborn as sns sns.heatmap(df.isnull(), yticklabels==False, cmap='viridis') Step 5: Drop Column which has way too many null values df.drop('Column name', axis=1, inplace=True) axis=1 is required to delete column and not row. inplace=True is required to update the dataframe. Step 6: Drop all rows having null values df.dropna(inplace=True) Step 7: Recheck dataframe df.isnull().sum() # should return 0 for all columns Step8: Change categorical-string data into columns with binary values For example sex/gender of a person can be changed to 0s and 1s and will still make sense. Using pandas' dummies function this ...

Scikit Learn's train_test_kit

Scikit Learn provides a function named "train_test_kit" to divide a dataset into two parts - train dataset and test dataset. Here is an example to see how to use it. Run the code in Jupyter Notebook. Note: We are using the dataset in the example using our own csv. Without a csv with same name and columns in your folder, your code will not work. from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=101) This takes dataset X and y and takes randomly 40% of rows for train dataset and rest for test dataset for both X and y. And through tuple unpacking, the 4 datasets are assigned to X_train, X_test, y_train, and y_test. Create Model Linear Regression Model X = df[['Avg. Area Income', 'Avg. Area House Age', 'Avg. Area Number of Rooms', 'Avg. Area Number of Bedrooms', 'Area Population']] y = df['Price'] from sklearn.linear_model import Li...

MongoDB CRUD Operations Introduction

Create insertOne(data, options) To insert one record. insertMany(data, options) To insert multiple records at once. Read find(filter, options) returns all data available after applying filter findOne(filter, options) returns the first occurrence after applying filter Update updateOne(filter, data, options) To find and update first record of the filter updateMany(filter, data, options) To find and update all records in the filter replaceOne(filter, data, options) very similar to updateOne Delete deleteOne(filter, options) delete the first record in the filter deleteMany(filter, options) deletes all records in the filter

Playing Around with Pandas, Matplotlib, Seaborn, Plotly, and Cufflinks Functions

Playing Around with Pandas, Matplotlib, Seaborn, Plotly, and Cufflinks Functions. idxmin() and idxmax() Pandas allows to get index of min and max using idxmin() and idxmax() respectively. .std() Examples: dataframe.idxmin() dataframe.idxmax() std() Pandas allows to calculate standard deviation for each column with .std() Examples: dataframe.std() .plot() .plot() directly shows line graph .iplot() Similar to plot() but for interactive graphs .xs() .xs() is called cross section. Used to get a sub column data. It has following arguments: key - for column name or row name axis - 1 if column, default is 0 for row level - if there is several levels to columns or rows Example: dataframe.xs(key='Column_name', axis=1, level='Super_column_name') Plot Moving Average dataframe['Column name'].rolling(window=30).mean().plot(label='30 day moving average') dataframe['Column name'].plot(label='30 day moving...

Built-in Plots in Pandas - Python

There are several plot types built-in to pandas, most of them statistical plots by nature: df.plot.area df.plot.barh df.plot.density df.plot.hist df.plot.line df.plot.scatter df.plot.bar df.plot.box df.plot.hexbin df.plot.kde df.plot.pie You can also just call df.plot(kind='hist') or replace that kind argument with any of the key terms shown in the list above (e.g. 'box','barh', etc..) Area Plot df2.plot.area() Bar Plot df2.plot.bar() Stacked Bar Plot df2.plot.bar(stacked=True) Histogram Plot df1['A'].plot.hist() Line Plot df1.plot.line(x=df1.index,y='B',figsize=(12,3),lw=1) Scatter Plot df1.plot.scatter(x='A',y='B') Scatter Plot with three variables df1.plot.scatter(x='A',y='B',c='C') Scatter Plot with third variable indicated by size df1.plot.scatter(x='A',y='B',s=df1['C']*200) Box Plot df2.plot.box() ...

Three styles of graphs using plt.style.use()

Couple of styles for your graphs. I have stored sample data in a csv file named df1 in the same folder where I am running this program. Run this program using Jupyter Notebook. import numpy as np import pandas as pd import seaborn as sns import matplotlib.pyplot as plt %matplotlib inline df1 = pd.read_csv('df1', index_col=0) df1.head() GG Plot Style plt.style.use('ggplot') df1['A'].hist() BM Style plt.style.use('bmh') df1['A'].hist() Dark Background Style plt.style.use('dark_background') df1['A'].hist()

Three Methods of Displaying a Histogram using Matplotlib and Seaborn

Three methods of displaying a histogram. I have stored sample data in a csv file named df1 in the same folder where I am running this program. Run this program using Jupyter Notebook import numpy as np import pandas as pd import seaborn as sns %matplotlib inline df1 = pd.read_csv('df1', index_col=0) df1.head() Method 1: .hist() df1['A'].hist(bins=30) Method 2: .plot(kind='hist') df1['A'].plot(kind='hist', bins=30) Method 3: .plot.hist() df1['A'].plot.hist(bins=30)

Matplotlib and Seaborn for Data Visualization - Python

Examples to learn Matplotlib and Seaborn for Data Visualization. Install Numpy, Matplotlib, and Seaborn with the following commands on Terminal/Command Prompt pip install numpy OR conda install numpy pip install matplotlib OR conda install matplotlib pip install seaborn OR conda install seaborn Run the following in Jupyter Notebook import matplotlib.pyplot as plt %matplotlib inline import numpy as np x = np.random.rand(50) x.sort() print(x) y = x**2 print(y) Plot with red line plt.plot(x, y, 'r-') plt.show() Plot with dotted red line plt.plot(x, y,'r--') plt.show() Plot with dots plt.plot(x, y,'r.') plt.show() Plot with '+' sign plt.plot(x, y,'r+') plt.show() Plot with labels of x and y and title plt.plot(x, y, 'r.') plt.xlabel('Number') plt.ylabel('Square') plt.title('y = x**2') Plot two or more using subplot plt.subplot(1,2,1) plt.plot(x,y,'r,...