Here is how to play with missing data and use groupby function in Pandas.
Run the code here: https://repl.it/@VinitKhandelwal/pandas-groupby
{'A': [1, 2, nan], 'B': [4, nan, nan], 'C': [7, 8, 9]}
A B C
0 1.0 4.0 7
1 2.0 NaN 8
2 NaN NaN 9
DROP ROWS
A B C
0 1.0 4.0 7
DROP COLUMNS
C
0 7
1 8
2 9
DROP ROWS WITH THRESHHOLD
A B C
0 1.0 4.0 7
1 2.0 NaN 8
A B C
0 1 4 7
1 2 Fill Value 8
2 Fill Value Fill Value 9
A B C
0 1.0 4.0 7
1 2.0 4.0 8
2 1.5 4.0 9
GROUP BY
{'Company': ['GOOG', 'GOOG', 'MSFT', 'MSFT', 'FB', 'FB'], 'Person': ['Sam', 'Charlie', 'Amy', 'Vanessa', 'Carl', 'Sarah'], 'Sales': [200, 120, 340, 124, 243, 350]}
Company Person Sales
0 GOOG Sam 200
1 GOOG Charlie 120
2 MSFT Amy 340
3 MSFT Vanessa 124
4 FB Carl 243
5 FB Sarah 350
Sales
Company
FB 296.5
GOOG 160.0
MSFT 232.0
Sales
Company
FB 75.660426
GOOG 56.568542
MSFT 152.735065
Sales
Company
FB 593
GOOG 320
MSFT 464
Sales 320
Name: GOOG, dtype: int64
Sales 320
Name: GOOG, dtype: int64
Person Sales
Company
FB 2 2
GOOG 2 2
MSFT 2 2
Person Sales
Company
FB Sarah 350
GOOG Sam 200
MSFT Vanessa 340
Person Sales
Company
FB Carl 243
GOOG Charlie 120
MSFT Amy 124
Sales
count mean std min 25% 50% 75% max
Company
FB 2.0 296.5 75.660426 243.0 269.75 296.5 323.25 350.0
GOOG 2.0 160.0 56.568542 120.0 140.00 160.0 180.00 200.0
MSFT 2.0 232.0 152.735065 124.0 178.00 232.0 286.00 340.0
Company FB GOOG MSFT
Sales count 2.000000 2.000000 2.000000
mean 296.500000 160.000000 232.000000
std 75.660426 56.568542 152.735065
min 243.000000 120.000000 124.000000
25% 269.750000 140.000000 178.000000
50% 296.500000 160.000000 232.000000
75% 323.250000 180.000000 286.000000
max 350.000000 200.000000 340.000000
Sales count 2.000000
mean 296.500000
std 75.660426
min 243.000000
25% 269.750000
50% 296.500000
75% 323.250000
max 350.000000
Name: FB, dtype: float64
Run the code here: https://repl.it/@VinitKhandelwal/pandas-groupby
import numpy as np
import pandas as pd
print("MISSING DATA")
d = {'A':[1,2,np.nan],'B':[4,np.nan,np.nan],'C':[7,8,9]}
print(d)
df = pd.DataFrame(d)
print(df)
print("DROP ROWS")
print(df.dropna())
print("DROP COLUMNS")
print(df.dropna(axis=1))
print("DROP ROWS WITH THRESHHOLD")
print(df.dropna(thresh=2))
print(df.fillna(value="Fill Value"))
print(df.fillna(value=df.mean()))
print("GROUP BY")
data = {'Company':['GOOG','GOOG','MSFT','MSFT','FB','FB'], 'Person':['Sam','Charlie','Amy','Vanessa','Carl','Sarah'], 'Sales':[200,120,340,124,243,350]}
print(data)
df = pd.DataFrame(data)
print(df)
by_comp = df.groupby('Company')
print(by_comp.mean())
print(by_comp.std())
print(by_comp.sum())
print(by_comp.sum().loc['GOOG'])
print(df.groupby('Company').sum().loc['GOOG'])
print(df.groupby('Company').count())
print(df.groupby('Company').max())
print(df.groupby('Company').min())
print(df.groupby('Company').describe())
print(df.groupby('Company').describe().transpose())
print(df.groupby('Company').describe().transpose()['FB'])
OUTPUT
MISSING DATA{'A': [1, 2, nan], 'B': [4, nan, nan], 'C': [7, 8, 9]}
A B C
0 1.0 4.0 7
1 2.0 NaN 8
2 NaN NaN 9
DROP ROWS
A B C
0 1.0 4.0 7
DROP COLUMNS
C
0 7
1 8
2 9
DROP ROWS WITH THRESHHOLD
A B C
0 1.0 4.0 7
1 2.0 NaN 8
A B C
0 1 4 7
1 2 Fill Value 8
2 Fill Value Fill Value 9
A B C
0 1.0 4.0 7
1 2.0 4.0 8
2 1.5 4.0 9
GROUP BY
{'Company': ['GOOG', 'GOOG', 'MSFT', 'MSFT', 'FB', 'FB'], 'Person': ['Sam', 'Charlie', 'Amy', 'Vanessa', 'Carl', 'Sarah'], 'Sales': [200, 120, 340, 124, 243, 350]}
Company Person Sales
0 GOOG Sam 200
1 GOOG Charlie 120
2 MSFT Amy 340
3 MSFT Vanessa 124
4 FB Carl 243
5 FB Sarah 350
Sales
Company
FB 296.5
GOOG 160.0
MSFT 232.0
Sales
Company
FB 75.660426
GOOG 56.568542
MSFT 152.735065
Sales
Company
FB 593
GOOG 320
MSFT 464
Sales 320
Name: GOOG, dtype: int64
Sales 320
Name: GOOG, dtype: int64
Person Sales
Company
FB 2 2
GOOG 2 2
MSFT 2 2
Person Sales
Company
FB Sarah 350
GOOG Sam 200
MSFT Vanessa 340
Person Sales
Company
FB Carl 243
GOOG Charlie 120
MSFT Amy 124
Sales
count mean std min 25% 50% 75% max
Company
FB 2.0 296.5 75.660426 243.0 269.75 296.5 323.25 350.0
GOOG 2.0 160.0 56.568542 120.0 140.00 160.0 180.00 200.0
MSFT 2.0 232.0 152.735065 124.0 178.00 232.0 286.00 340.0
Company FB GOOG MSFT
Sales count 2.000000 2.000000 2.000000
mean 296.500000 160.000000 232.000000
std 75.660426 56.568542 152.735065
min 243.000000 120.000000 124.000000
25% 269.750000 140.000000 178.000000
50% 296.500000 160.000000 232.000000
75% 323.250000 180.000000 286.000000
max 350.000000 200.000000 340.000000
Sales count 2.000000
mean 296.500000
std 75.660426
min 243.000000
25% 269.750000
50% 296.500000
75% 323.250000
max 350.000000
Name: FB, dtype: float64
Comments
Post a Comment