Pandas with Python

 










Hello everyone!!!!



So in this blog, we are going to learn about pandas.....



So let's get started.....



Course content:





  • Introduction to pandas


  • Data Structures

  • Series 

  • DataFrame

  • Re-indexing


  • Operation between series and dataframe


  • Sorting and Ranking


  • Descriptive statistics 


  • Data loading, storage, and file formats



Introduction to Pandas:





  • Pandas is an open-source, BSD-licensed Python library providing
    high-performance, easy-to-use data structures and data analysis
    tools for the Python programming language.



  • Python with Pandasis used in a wide range of fields including academic
    and commercial domains including finance, economics, statistics,
    analytics, etc.



  • Fast and efficient DataFrame object with default and customized
    indexing.



  • Tools for loading data into in-memory data objects from different file
    formats. 



  • Data alignment and integrated handling of missing data.


  • Reshaping and pivoting of date sets.


  • Label-based slicing, indexing, and sub setting of large data
    sets.



  • Columns from a data structure can be deleted or inserted.


  • Group by data for aggregation and transformations.


  • High-performance merging and joining of data.



Data Structures:




  • Pandas deals with the following three data structures —


  • Series:- 1 Dimensional 


  • DataFrame:- 2 Dimensional


  • Panel:- 3 Dimensional 





These data structures are built on top of a Numpy array, which means they
are fast.



The best way to think of these data structures is that the higher
dimensional data structure is a container of its lower-dimensional data
structure. 



For example,DataFrameis a container of Series, Panel is a container of
DataFrame.







Series:




  • Series is a 1-dimensional array like structure with homogenous data
    capable of holding data of any type (int, float, string, python objects,
    etc.)



  • The axis labeled is collectively called index.


  • Key points: Homogenous data, Size immutable, Values of data
    mutable.



  • Pandas series can be created by using the-


pandas.series(data,index,dtype,copy)


  
import pandas as pd
import numpy as np
s=pd.Series(dtype=float)
print(s)
print(type(s))
Output=Series([], dtype: float64)
class 'pandas.core.series.Series'>

data=np.array(['a','b','c','d'])
print(data)
Output=['a' 'b' 'c' 'd']
s=pd.Series(data)
s
Output=
0 a
1 b
2 c
3 d
dtype: object

s=pd.Series(data,index=[111,222,333,444])
s
Output=
111 a
222 b
333 c
444 d
dtype: object


data={'a':0,'b':1,'c':2}
print(data)
Output={'a': 0, 'b': 1, 'c': 2}
s=pd.Series(data)
s
Output=
a 0
b 1
c 2
dtype: int64

s=pd.Series(data,index=['b','c','d','a'])
s
Output=
b 1.0
c 2.0
d NaN
a 0.0
dtype: float64

s=pd.Series(5,index=[0,1,2,3])
s
Output=
0 5
1 5
2 5
3 5
dtype: int64

s=pd.Series([1,2,3,4,5],index=['a','b','c','d','e'])
s
Output=
a 1
b 2
c 3
d 4
e 5
dtype: int64

print(s[0])
print(s[:3])
print(s[-3:])
print(s['a'])
print(s[['a','c','d']])
Output=
1

a 1
b 2
c 3
dtype: int64

c 3
d 4
e 5
dtype: int64

1

a 1
c 3
d 4
dtype: int64





                     Data Operations

s=pd.Series(np.random.randn(8))
print(s)
Output=
0 -0.263335
1 0.598104
2 2.115186
3 0.163075
4 -0.529759
5 1.268830
6 0.215765
7 1.313002
dtype: float64

s.axes
Output=[RangeIndex(start=0, stop=8, step=1)]
s2=pd.Series(np.random.randn(4),index=[11,12,13,14])
s2
Output=11 0.661457
12 0.112885
13 -0.334969
14 0.350130
dtype: float64

s2.axes
Output=[Int64Index([11, 12, 13, 14], dtype='int64')]
s.dtypes
Output=dtype('float64')
s.ndim
Output=1
s.size
Output=8
se=pd.Series(dtype=float)
print(se)
Output=Series([], dtype: float64)
se.empty
Output=True
s.empty
Output=False
s.values
Output=array([-0.26333516, 0.59810371, 2.11518637, 0.16307502, -0.52975919,
1.26882994, 0.21576517, 1.31300243])

s.head(2)
Output=
0 -0.263335
1 0.598104
dtype: float64


s.tail(2)
Output=
6 0.215765
7 1.313002
dtype: float64


s.head()
Output=
0 -0.263335
1 0.598104
2 2.115186
3 0.163075
4 -0.529759
dtype: float64


s.tail()
Output=
3 0.163075
4 -0.529759
5 1.268830
6 0.215765
7 1.313002
dtype: float64









DataFrame:




  • A dataframe is a two-dimensional data structure, i.e., data is aligned
    in a tabular fashion in rows and columns.



Features of dataframe:




  • Potentially columns are of different types

  • Size is mutable


  • Labeled axes (rows and columns)


  • Can perform arithmetic operations on rows and columns.



A pandas Dataframe can be created using various inputs like-



pandas.Dataframe(data,index,columns,dtype,copy)



  • Lists

  • Dictionary

  • Series

  • Numpy ndarrays

  • Another DataFrame




df=pd.DataFrame()
print(df)
Output=
Empty DataFrame
Columns: []
Index: []

data=[11,22,33,44,55]
df=pd.DataFrame(data)
print(df)
Output=
0
0 11
1 22
2 33
3 44
4 55

data=[["Alok",10],["Bhushan",20],["Anil",30]]
df=pd.DataFrame(data,columns=["Name","age"])
print(df)
Output=
Name age
0 Alok 10
1 Bhushan 20
2 Anil 30

data=[["Alok",10],["Bhushan",20],["Anil",30]]
df=pd.DataFrame(data,columns=["Name","age"],dtype=float)
print(df)
Output=
Name age
0 Alok 10.0
1 Bhushan 20.0
2 Anil 30.0

data={"Name":["Swapnil","Anil","Anup","Viraj"],
"Age":[20,21,22,24]}
print(data)
Output={'Name': ['Swapnil', 'Anil', 'Anup', 'Viraj'], 'Age': [20, 21, 22, 24]}
df=pd.DataFrame(data)
print(df)
Output=
Name Age
0 Swapnil 20
1 Anil 21
2 Anup 22
3 Viraj 24


d={'Names':pd.Series(["Praanay","Prem","Atul","Amar","Sarthak"]),"Age":pd.Series([20,25,21,22,23]),"Rating":pd.Series([2.2,2.3,5.3,1.6,4.5])}
df=pd.DataFrame(d,columns=['Names',"Age","Rating"])
print(df)
Output=
Names Age Rating
0 Praanay 20 2.2
1 Prem 25 2.3
2 Atul 21 5.3
3 Amar 22 1.6
4 Sarthak 23 4.5

print(df.T)
Output= 0 1 2 3 4
Names Praanay Prem Atul Amar Sarthak
Age 20 25 21 22 23
Rating 2.2 2.3 5.3 1.6 4.5

print(df.axes)
Output=[RangeIndex(start=0, stop=5, step=1), Index(['Names', 'Age', 'Rating'], dtype='object')]
print(df.ndim)
Output=2
print(df.shape)
Output=(5, 3)
print(df.size)
Output=15
print(df.values)
Output=
[['Praanay' 20 2.2]
['Prem' 25 2.3]
['Atul' 21 5.3]
['Amar' 22 1.6]
['Sarthak' 23 4.5]]


data=pd.DataFrame(np.arange(16).reshape(4,4),index=["Indore","Raipur","Nagpur","Hyderabad"],columns=['one','two','three','four'])
print(data)
Output=
one two three four
Indore 0 1 2 3
Raipur 4 5 6 7
Nagpur 8 9 10 11
Hyderabad 12 13 14 15





Re-Indexing:





  • A critical method on pandas objects is reindex(), which means to create
    a new object with the data conformed to a new index.



  • For ordered data like time series, it may be desirable to do some
    interpolation or filling of values when reindexing.



  • The method option allows us to do this, using a method such as ffil
    which forward fills the values.




states=["Raipur","Indore","Hyderabad"]
frame.reindex(columns=states)
Output=
Raipur Indore Hyderabad
a 2 NaN NaN
b 5 NaN NaN
c 8 NaN NaN

frame.reindex(index=['a','b','c','d'],columns=states)
Output=
Raipur Indore Hyderabad
a 2.0 NaN NaN
b 5.0 NaN NaN
c 8.0 NaN NaN
d NaN NaN NaN





Drop Command:





  • With the dataframe, index values can be deleted from either axis.


  • Dropping one or more entries from an axis is easy if one has an index
    array or list without those entries. As that can require a bit of set
    logic, the drop method will return a new object with the indicated value
    or values deleted from an axis.




obj=pd.Series(np.arange(5),index=['a','b','c','d','e'])
print(obj)
Output=
a 0
b 1
c 2
d 3
e 4
dtype: int32

new_obj=obj.drop('c')
new_obj
Output=
a 0
b 1
d 3
e 4
dtype: int32

new_obj=obj.drop(['c','d'])
new_obj
Output=
a 0
b 1
e 4
dtype: int32





Arithmetic and data alignment:





  • One of the most important pandas features is the behavior of arithmetic
    between objects with different indexes.



  • When adding together objects, if any index pairs are not the same, the
    respective index in the result will be the union of the index
    pairs.



  • The internal data alignment introduces NAN values in the indices that
    don't overlap.



  • In the case of dataframe, alignment is performed on both the rows and
    the columns, which returns a dataframe whose index and columns are the
    unions of the ones in each dataframe.



  • Relatively, when reindexing a series or dataframe, one can also specify
    a different fill value.




Operations between Dataframe & Series:





  • As with NumPy arrays, arithmetic between Dataframe and series is well
    defined.



  • By default, arithmetic between Dataframe and series matches the index
    of the series on the dataframe's columns, broadcasting down the
    rows.



  • If an index value is not found in either the dataframe columns or the
    series index, the objects will be reindexed to form the union.



  • If one wants to instead broadcast over the columns, matching on the
    rows, one has to use arithmetic methods.




series2=pd.Series(range(3),index=['b','e','f'])
series2
Output=
b 0
e 1
f 2
dtype: int64

print(frame)
Output=
b d e
Raipur 0 1 2
Nagpur 3 4 5
hyderabad 6 7 8
indore 9 10 11

frame+series2
Output=
b d e f
Raipur 0.0 NaN 3.0 NaN
Nagpur 3.0 NaN 6.0 NaN
hyderabad 6.0 NaN 9.0 NaN
indore 9.0 NaN 12.0 NaN

series3=frame['d']
series3
Output=
Raipur 1
Nagpur 4
hyderabad 7
indore 10
Name: d, dtype: int32

frame
Output=
b d e
Raipur 0 1 2
Nagpur 3 4 5
hyderabad 6 7 8
indore 9 10 11

frame.sub(series3,axis="index")
Output=
b d e
Raipur -1 0 1
Nagpur -1 0 1
hyderabad -1 0 1
indore -1 0 1





Function application and mapping:





  • NumPy ufuncs (element-wise array methods) work fine with pandas
    objects.



  • Another frequent operation is applying a function on 1D arrays to each
    column or row. DataFrame’s apply method does exactly this.



  • Many of the most common array statistics (like sum and mean) are
    DataFram methods, so using apply is not necessary. 



  • The function passed to apply need not return a scalar value, it can
    also return a scaler value it also returns a Series with multiple
    values.




frame=pd.DataFrame(np.random.randn(4,3),columns=list('bde'),index=['raipur','nagpur','hyderabad','indore'])
frame
Output=
b d e
raipur 1.977048 -1.860493 0.768591
nagpur -1.498661 -2.329090 0.222861
hyderabad 0.110777 -0.467806 -0.943308
indore -0.033976 -0.147853 0.157741


np.abs
Output= ufunc 'absolute'>

f=lambda x:x.max()-x.min()
frame.apply(f)
Output=
b 3.475709
d 2.181237
e 1.711899
dtype: float64


frame.apply(f,axis='columns')
Output=
raipur 3.837542
nagpur 2.551952
hyderabad 1.054084
indore 0.305594
dtype: float64


def f(x):
return pd.Series([x.min(),x.max()],index=['min','max'])
frame.apply(f)
Output=
b d e
min -1.498661 -2.329090 -0.943308
max 1.977048 -0.147853 0.768591





Sorting and Ranking: 





  • Sorting a data set by some criterion is another important built-in
    operation. To sort lexicographically by row or column index, use the
    sort_index () method, which returns a new, sorted object.



  • With a dataframe, one can sort by index on either axis. The data is
    sorted in ascending order by default but can be sorted in descending
    order too.



  • The rank methods for Series and DataFrame are the place to look; by
    default, rank breaks ties by assigning each group the mean rank.



  • Ranks can also be assigned according to the order they’re observed in
    the data.



  • Naturally, one can rank in descending order, too.



obj=pd.Series(range(4),index=['d','a','b','c'])
obj
Output=
d 0
a 1
b 2
c 3
dtype: int64

obj.sort_index()
Output=
a 1
b 2
c 3
d 0
dtype: int64


frame=pd.DataFrame(np.arange(8).reshape((2,4)),index=['three','one'],columns=['d','a','b','c'])
print(frame)
Output=
d a b c
three 0 1 2 3
one 4 5 6 7


frame.sort_index()
Output=
d a b c
one 4 5 6 7
three 0 1 2 3


frame.sort_index(axis=1)
Output=
a b c d
three 1 2 3 0
one 5 6 7 4


frame=pd.DataFrame({'b':[4,7,-3,2],'a':[0,1,0,1]})
frame
Output=
b a
0 4 0
1 7 1
2 -3 0
3 2 1


frame.sort_values(by='b')
Output=
b a
2 -3 0
3 2 1
0 4 0
1 7 1








Axis indexes with duplicate values: 




  • Up until now all of the examples we have seen, had unique axis labels
    (index values).While many pandas functions (like reindex()) require that
    the labels be unique, it’s notmandatory.



  • The index's is_unique property can tell you whether its values are
    unique or not.



  • Data selection is one of the main things that behaves differently with
    duplicates. Indexing a value with multiple entries returns Series while
    single entries return a scalar value.




obj=pd.Series(range(5),index=['a','a','b','b','c'])
obj
Output=
a 0
a 1
b 2
b 3
c 4
dtype: int64

obj.index.is_unique
Output=False
obj['a']
Output=
a 0
a 1
dtype: int64

obj['c']
Output=4
df=pd.DataFrame(np.random.randn(4,3),index=['a','a','b','c'])
df
Output=
0 1 2
a -0.948989 -0.236842 1.203461
a -1.186551 0.934325 -1.282523
b 0.679511 -1.089725 1.387880
c 0.743163 -0.895804 0.361094


df.loc['b']
Output=
0 0.679511
1 -1.089725
2 1.387880
Name: b, dtype: float64


df.loc['a']
Output=
0 1 2
a -0.948989 -0.236842 1.203461
a -1.186551 0.934325 -1.282523




Descriptive statistics with pandas:




  • Pandas objects are equipped with a set of common mathematical and
    statistical methods. Most of these fall into the category of reductions
    or summary statistics, methods that extract a single value (like the sum
    or mean) from a Series or a Series of values from the rows or columns of
    a DataFrame. Compared with the equivalent methods of NumPy arrays, they
    are all built from the ground up to exclude missing data.



  • NA values are excluded unless the entire slice is NA. This can be
    disable using skipna option.




df=pd.DataFrame([[1.4,np.nan],[7.1,-4.5],[np.nan,np.nan],[0.75,-1.3]],index = ['a','b','c','d'],columns=['one','two'])
df
Output=
one two
a 1.40 NaN
b 7.10 -4.5
c NaN NaN
d 0.75 -1.3


df.sum()
Output=
one 9.25
two -5.80
dtype: float64


df.sum(axis='columns')
Output=
a 1.40
b 2.60
c 0.00
d -0.55
dtype: float64


df.mean(axis='columns',skipna=False)
Output=
a NaN
b 1.300
c NaN
d -0.275
dtype: float64


df.cumsum()
Output=
one two
a 1.40 NaN
b 8.50 -4.5
c NaN NaN
d 9.25 -5.8


df.describe()
Output=
one two
count 3.000000 2.000000
mean 3.083333 -2.900000
std 3.493685 2.262742
min 0.750000 -4.500000
25% 1.075000 -3.700000
50% 1.400000 -2.900000
75% 4.250000 -2.100000
max 7.100000 -1.300000


obj=pd.Series(['a','a','b','c']*4)
obj
Output=
0 a
1 a
2 b
3 c
4 a
5 a
6 b
7 c
8 a
9 a
10 b
11 c
12 a
13 a
14 b
15 c
dtype: object


obj.describe()
Output=count 16
unique 3
top a
freq 8
dtype: object




Data loading,storage and file formats:




  • The tools & libraries for data analysis are of little use if one
    can’t easily import and export data in Python. We will be focused on
    input and output with pandas objects, though there are of course
    numerous tools in other libraries to aid in this process.



  • Input and output typically falls into a few main categories:


  • Reading text files and other more efficient on-disk formats


  • Loading data from databases 


  • Interacting with network sources like web APIs.

    I am giving you a txt file for practice.,in future you have to work with the database.



  • Download txt file (temp)



df=pd.read_csv("temp.txt")
df
Output=
S.No Name Age City Salary DOB
0 1 Vishal NaN Nagpur 20000 22-12-1998
1 2 Pranay 32.0 Mumbai 3000 23-02-1991
2 3 Akshay 43.0 Banglore 8300 12-05-1985
3 4 Ram 38.0 Hyderabad 3900 01-12-1992


print(df.shape)
Output=(4, 6)
df=pd.read_csv("temp.txt",usecols=["Name","Age"])
df
Output=
Name Age
0 Vishal NaN
1 Pranay 32.0
2 Akshay 43.0
3 Ram 38.0


df=pd.read_csv("temp.txt",index_col=['S.No'])
df
Output=
S.No Name Age City Salary DOB

1 Vishal NaN Nagpur 20000 22-12-1998
2 Pranay 32.0 Mumbai 3000 23-02-1991
3 Akshay 43.0 Banglore 8300 12-05-1985
4 Ram 38.0 Hyderabad 3900 01-12-1992


df.dtypes
Output=
Name object
Age float64
City object
Salary int64
DOB object
dtype: object


date_cols=['DOB']
df=pd.read_csv('temp.txt',parse_dates=date_cols)
df
Output=
S.No Name Age City Salary DOB
0 1 Vishal NaN Nagpur 20000 1998-12-22
1 2 Pranay 32.0 Mumbai 3000 1991-02-23
2 3 Akshay 43.0 Banglore 8300 1985-12-05
3 4 Ram 38.0 Hyderabad 3900 1992-01-12


df['DOB'].dt.year
Output=0 1998
1 1991
2 1985
3 1992
Name: DOB, dtype: int64


df=pd.read_csv('temp.txt',names=['a','b','c','d','e','f'])
df
Output=
a b c d e f
0 S.No Name Age City Salary DOB
1 1 Vishal NaN Nagpur 20000 22-12-1998
2 2 Pranay 32 Mumbai 3000 23-02-1991
3 3 Akshay 43 Banglore 8300 12-05-1985
4 4 Ram 38 Hyderabad 3900 01-12-1992



df=pd.read_csv('temp.txt',names=['a','b','c','d','e','f'],header=0)
df
Output=
a b c d e f
0 1 Vishal NaN Nagpur 20000 22-12-1998
1 2 Pranay 32.0 Mumbai 3000 23-02-1991
2 3 Akshay 43.0 Banglore 8300 12-05-1985
3 4 Ram 38.0 Hyderabad 3900 01-12-1992


df=pd.read_csv('temp.txt',skiprows=2,names=['a','b','c','d','e','f'],header=0)
df
Output=
a b c d e f
0 3 Akshay 43 Banglore 8300 12-05-1985
1 4 Ram 38 Hyderabad 3900 01-12-1992


df=pd.read_csv("temp.txt")
df.loc[0,'Age']=21
df
Output=
S.No Name Age City Salary DOB
0 1 Vishal 21.0 Nagpur 20000 22-12-1998
1 2 Pranay 32.0 Mumbai 3000 23-02-1991
2 3 Akshay 43.0 Banglore 8300 12-05-1985
3 4 Ram 38.0 Hyderabad 3900 01-12-1992





In this way you have to do the operations on different files.



This topic is a vast topic, try to understand it and practise
regularly. 


Best regards from,


msbtenotes:)


THANK YOU!!!


Comments