Hello everyone!!!!
So in this blog, we are going to learn about pandas.....
So let's get started.....
Course content:
Introduction to pandas
Data Structures
- Series
- DataFrame
- Re-indexing
Operation between series and dataframe
Sorting and Ranking
Descriptive statistics
Data loading, storage, and file formats
Introduction to Pandas:
Pandas is an open-source, BSD-licensed Python library providing
high-performance, easy-to-use data structures and data analysis
tools for the Python programming language.
Python with Pandasis used in a wide range of fields including academic
and commercial domains including finance, economics, statistics,
analytics, etc.
Fast and efficient DataFrame object with default and customized
indexing.
Tools for loading data into in-memory data objects from different file
formats.
Data alignment and integrated handling of missing data.
Reshaping and pivoting of date sets.
Label-based slicing, indexing, and sub setting of large data
sets.
Columns from a data structure can be deleted or inserted.
Group by data for aggregation and transformations.
High-performance merging and joining of data.
Data Structures:
Pandas deals with the following three data structures —
Series:- 1 Dimensional
DataFrame:- 2 Dimensional
Panel:- 3 Dimensional
These data structures are built on top of a Numpy array, which means they
are fast.
The best way to think of these data structures is that the higher
dimensional data structure is a container of its lower-dimensional data
structure.
For example,DataFrameis a container of Series, Panel is a container of
DataFrame.
Series:
Series is a 1-dimensional array like structure with homogenous data
capable of holding data of any type (int, float, string, python objects,
etc.)
The axis labeled is collectively called index.
Key points: Homogenous data, Size immutable, Values of data
mutable.
Pandas series can be created by using the-
pandas.series(data,index,dtype,copy)
import pandas as pd
import numpy as np
s=pd.Series(dtype=float)
print(s)
print(type(s))
Output=Series([], dtype: float64)
class 'pandas.core.series.Series'>
data=np.array(['a','b','c','d'])
print(data)
Output=['a' 'b' 'c' 'd']
s=pd.Series(data)
s
Output=
0 a
1 b
2 c
3 d
dtype: object
s=pd.Series(data,index=[111,222,333,444])
s
Output=
111 a
222 b
333 c
444 d
dtype: object
data={'a':0,'b':1,'c':2}
print(data)
Output={'a': 0, 'b': 1, 'c': 2}
s=pd.Series(data)
s
Output=
a 0
b 1
c 2
dtype: int64
s=pd.Series(data,index=['b','c','d','a'])
s
Output=
b 1.0
c 2.0
d NaN
a 0.0
dtype: float64
s=pd.Series(5,index=[0,1,2,3])
s
Output=
0 5
1 5
2 5
3 5
dtype: int64
s=pd.Series([1,2,3,4,5],index=['a','b','c','d','e'])
s
Output=
a 1
b 2
c 3
d 4
e 5
dtype: int64
print(s[0])
print(s[:3])
print(s[-3:])
print(s['a'])
print(s[['a','c','d']])
Output=
1
a 1
b 2
c 3
dtype: int64
c 3
d 4
e 5
dtype: int64
1
a 1
c 3
d 4
dtype: int64
Data Operations
s=pd.Series(np.random.randn(8))
print(s)
Output=
0 -0.263335
1 0.598104
2 2.115186
3 0.163075
4 -0.529759
5 1.268830
6 0.215765
7 1.313002
dtype: float64
s.axes
Output=[RangeIndex(start=0, stop=8, step=1)]
s2=pd.Series(np.random.randn(4),index=[11,12,13,14])
s2
Output=11 0.661457
12 0.112885
13 -0.334969
14 0.350130
dtype: float64
s2.axes
Output=[Int64Index([11, 12, 13, 14], dtype='int64')]
s.dtypes
Output=dtype('float64')
s.ndim
Output=1
s.size
Output=8
se=pd.Series(dtype=float)
print(se)
Output=Series([], dtype: float64)
se.empty
Output=True
s.empty
Output=False
s.values
Output=array([-0.26333516, 0.59810371, 2.11518637, 0.16307502, -0.52975919,
1.26882994, 0.21576517, 1.31300243])
s.head(2)
Output=
0 -0.263335
1 0.598104
dtype: float64
s.tail(2)
Output=
6 0.215765
7 1.313002
dtype: float64
s.head()
Output=
0 -0.263335
1 0.598104
2 2.115186
3 0.163075
4 -0.529759
dtype: float64
s.tail()
Output=
3 0.163075
4 -0.529759
5 1.268830
6 0.215765
7 1.313002
dtype: float64
DataFrame:
A dataframe is a two-dimensional data structure, i.e., data is aligned
in a tabular fashion in rows and columns.
Features of dataframe:
Potentially columns are of different types
- Size is mutable
Labeled axes (rows and columns)
Can perform arithmetic operations on rows and columns.
A pandas Dataframe can be created using various inputs like-
pandas.Dataframe(data,index,columns,dtype,copy)
- Lists
- Dictionary
- Series
- Numpy ndarrays
- Another DataFrame
df=pd.DataFrame()
print(df)
Output=
Empty DataFrame
Columns: []
Index: []
data=[11,22,33,44,55]
df=pd.DataFrame(data)
print(df)
Output=
0
0 11
1 22
2 33
3 44
4 55
data=[["Alok",10],["Bhushan",20],["Anil",30]]
df=pd.DataFrame(data,columns=["Name","age"])
print(df)
Output=
Name age
0 Alok 10
1 Bhushan 20
2 Anil 30
data=[["Alok",10],["Bhushan",20],["Anil",30]]
df=pd.DataFrame(data,columns=["Name","age"],dtype=float)
print(df)
Output=
Name age
0 Alok 10.0
1 Bhushan 20.0
2 Anil 30.0
data={"Name":["Swapnil","Anil","Anup","Viraj"],
"Age":[20,21,22,24]}
print(data)
Output={'Name': ['Swapnil', 'Anil', 'Anup', 'Viraj'], 'Age': [20, 21, 22, 24]}
df=pd.DataFrame(data)
print(df)
Output=
Name Age
0 Swapnil 20
1 Anil 21
2 Anup 22
3 Viraj 24
d={'Names':pd.Series(["Praanay","Prem","Atul","Amar","Sarthak"]),"Age":pd.Series([20,25,21,22,23]),"Rating":pd.Series([2.2,2.3,5.3,1.6,4.5])}
df=pd.DataFrame(d,columns=['Names',"Age","Rating"])
print(df)
Output=
Names Age Rating
0 Praanay 20 2.2
1 Prem 25 2.3
2 Atul 21 5.3
3 Amar 22 1.6
4 Sarthak 23 4.5
print(df.T)
Output= 0 1 2 3 4
Names Praanay Prem Atul Amar Sarthak
Age 20 25 21 22 23
Rating 2.2 2.3 5.3 1.6 4.5
print(df.axes)
Output=[RangeIndex(start=0, stop=5, step=1), Index(['Names', 'Age', 'Rating'], dtype='object')]
print(df.ndim)
Output=2
print(df.shape)
Output=(5, 3)
print(df.size)
Output=15
print(df.values)
Output=
[['Praanay' 20 2.2]
['Prem' 25 2.3]
['Atul' 21 5.3]
['Amar' 22 1.6]
['Sarthak' 23 4.5]]
data=pd.DataFrame(np.arange(16).reshape(4,4),index=["Indore","Raipur","Nagpur","Hyderabad"],columns=['one','two','three','four'])
print(data)
Output=
one two three four
Indore 0 1 2 3
Raipur 4 5 6 7
Nagpur 8 9 10 11
Hyderabad 12 13 14 15
Re-Indexing:
A critical method on pandas objects is reindex(), which means to create
a new object with the data conformed to a new index.
For ordered data like time series, it may be desirable to do some
interpolation or filling of values when reindexing.
The method option allows us to do this, using a method such as ffil
which forward fills the values.
states=["Raipur","Indore","Hyderabad"]
frame.reindex(columns=states)
Output=
Raipur Indore Hyderabad
a 2 NaN NaN
b 5 NaN NaN
c 8 NaN NaN
frame.reindex(index=['a','b','c','d'],columns=states)
Output=
Raipur Indore Hyderabad
a 2.0 NaN NaN
b 5.0 NaN NaN
c 8.0 NaN NaN
d NaN NaN NaN
Drop Command:
With the dataframe, index values can be deleted from either axis.
Dropping one or more entries from an axis is easy if one has an index
array or list without those entries. As that can require a bit of set
logic, the drop method will return a new object with the indicated value
or values deleted from an axis.
obj=pd.Series(np.arange(5),index=['a','b','c','d','e'])
print(obj)
Output=
a 0
b 1
c 2
d 3
e 4
dtype: int32
new_obj=obj.drop('c')
new_obj
Output=
a 0
b 1
d 3
e 4
dtype: int32
new_obj=obj.drop(['c','d'])
new_obj
Output=
a 0
b 1
e 4
dtype: int32
Arithmetic and data alignment:
One of the most important pandas features is the behavior of arithmetic
between objects with different indexes.
When adding together objects, if any index pairs are not the same, the
respective index in the result will be the union of the index
pairs.
The internal data alignment introduces NAN values in the indices that
don't overlap.
In the case of dataframe, alignment is performed on both the rows and
the columns, which returns a dataframe whose index and columns are the
unions of the ones in each dataframe.
Relatively, when reindexing a series or dataframe, one can also specify
a different fill value.
Operations between Dataframe & Series:
As with NumPy arrays, arithmetic between Dataframe and series is well
defined.
By default, arithmetic between Dataframe and series matches the index
of the series on the dataframe's columns, broadcasting down the
rows.
If an index value is not found in either the dataframe columns or the
series index, the objects will be reindexed to form the union.
If one wants to instead broadcast over the columns, matching on the
rows, one has to use arithmetic methods.
series2=pd.Series(range(3),index=['b','e','f'])
series2
Output=
b 0
e 1
f 2
dtype: int64
print(frame)
Output=
b d e
Raipur 0 1 2
Nagpur 3 4 5
hyderabad 6 7 8
indore 9 10 11
frame+series2
Output=
b d e f
Raipur 0.0 NaN 3.0 NaN
Nagpur 3.0 NaN 6.0 NaN
hyderabad 6.0 NaN 9.0 NaN
indore 9.0 NaN 12.0 NaN
series3=frame['d']
series3
Output=
Raipur 1
Nagpur 4
hyderabad 7
indore 10
Name: d, dtype: int32
frame
Output=
b d e
Raipur 0 1 2
Nagpur 3 4 5
hyderabad 6 7 8
indore 9 10 11
frame.sub(series3,axis="index")
Output=
b d e
Raipur -1 0 1
Nagpur -1 0 1
hyderabad -1 0 1
indore -1 0 1
Function application and mapping:
NumPy ufuncs (element-wise array methods) work fine with pandas
objects.
Another frequent operation is applying a function on 1D arrays to each
column or row. DataFrame’s apply method does exactly this.
Many of the most common array statistics (like sum and mean) are
DataFram methods, so using apply is not necessary.
The function passed to apply need not return a scalar value, it can
also return a scaler value it also returns a Series with multiple
values.
frame=pd.DataFrame(np.random.randn(4,3),columns=list('bde'),index=['raipur','nagpur','hyderabad','indore'])
frame
Output=
b d e
raipur 1.977048 -1.860493 0.768591
nagpur -1.498661 -2.329090 0.222861
hyderabad 0.110777 -0.467806 -0.943308
indore -0.033976 -0.147853 0.157741
np.abs
Output= ufunc 'absolute'>
f=lambda x:x.max()-x.min()
frame.apply(f)
Output=
b 3.475709
d 2.181237
e 1.711899
dtype: float64
frame.apply(f,axis='columns')
Output=
raipur 3.837542
nagpur 2.551952
hyderabad 1.054084
indore 0.305594
dtype: float64
def f(x):
return pd.Series([x.min(),x.max()],index=['min','max'])
frame.apply(f)
Output=
b d e
min -1.498661 -2.329090 -0.943308
max 1.977048 -0.147853 0.768591
Sorting and Ranking:
Sorting a data set by some criterion is another important built-in
operation. To sort lexicographically by row or column index, use the
sort_index () method, which returns a new, sorted object.
With a dataframe, one can sort by index on either axis. The data is
sorted in ascending order by default but can be sorted in descending
order too.
The rank methods for Series and DataFrame are the place to look; by
default, rank breaks ties by assigning each group the mean rank.
Ranks can also be assigned according to the order they’re observed in
the data.
Naturally, one can rank in descending order, too.
obj=pd.Series(range(4),index=['d','a','b','c'])
obj
Output=
d 0
a 1
b 2
c 3
dtype: int64
obj.sort_index()
Output=
a 1
b 2
c 3
d 0
dtype: int64
frame=pd.DataFrame(np.arange(8).reshape((2,4)),index=['three','one'],columns=['d','a','b','c'])
print(frame)
Output=
d a b c
three 0 1 2 3
one 4 5 6 7
frame.sort_index()
Output=
d a b c
one 4 5 6 7
three 0 1 2 3
frame.sort_index(axis=1)
Output=
a b c d
three 1 2 3 0
one 5 6 7 4
frame=pd.DataFrame({'b':[4,7,-3,2],'a':[0,1,0,1]})
frame
Output=
b a
0 4 0
1 7 1
2 -3 0
3 2 1
frame.sort_values(by='b')
Output=
b a
2 -3 0
3 2 1
0 4 0
1 7 1
Axis indexes with duplicate values:
Up until now all of the examples we have seen, had unique axis labels
(index values).While many pandas functions (like reindex()) require that
the labels be unique, it’s notmandatory.
The index's is_unique property can tell you whether its values are
unique or not.
Data selection is one of the main things that behaves differently with
duplicates. Indexing a value with multiple entries returns Series while
single entries return a scalar value.
obj=pd.Series(range(5),index=['a','a','b','b','c'])
obj
Output=
a 0
a 1
b 2
b 3
c 4
dtype: int64
obj.index.is_unique
Output=False
obj['a']
Output=
a 0
a 1
dtype: int64
obj['c']
Output=4
df=pd.DataFrame(np.random.randn(4,3),index=['a','a','b','c'])
df
Output=
0 1 2
a -0.948989 -0.236842 1.203461
a -1.186551 0.934325 -1.282523
b 0.679511 -1.089725 1.387880
c 0.743163 -0.895804 0.361094
df.loc['b']
Output=
0 0.679511
1 -1.089725
2 1.387880
Name: b, dtype: float64
df.loc['a']
Output=
0 1 2
a -0.948989 -0.236842 1.203461
a -1.186551 0.934325 -1.282523
Descriptive statistics with pandas:
Pandas objects are equipped with a set of common mathematical and
statistical methods. Most of these fall into the category of reductions
or summary statistics, methods that extract a single value (like the sum
or mean) from a Series or a Series of values from the rows or columns of
a DataFrame. Compared with the equivalent methods of NumPy arrays, they
are all built from the ground up to exclude missing data.
NA values are excluded unless the entire slice is NA. This can be
disable using skipna option.
df=pd.DataFrame([[1.4,np.nan],[7.1,-4.5],[np.nan,np.nan],[0.75,-1.3]],index = ['a','b','c','d'],columns=['one','two'])
df
Output=
one two
a 1.40 NaN
b 7.10 -4.5
c NaN NaN
d 0.75 -1.3
df.sum()
Output=
one 9.25
two -5.80
dtype: float64
df.sum(axis='columns')
Output=
a 1.40
b 2.60
c 0.00
d -0.55
dtype: float64
df.mean(axis='columns',skipna=False)
Output=
a NaN
b 1.300
c NaN
d -0.275
dtype: float64
df.cumsum()
Output=
one two
a 1.40 NaN
b 8.50 -4.5
c NaN NaN
d 9.25 -5.8
df.describe()
Output=
one two
count 3.000000 2.000000
mean 3.083333 -2.900000
std 3.493685 2.262742
min 0.750000 -4.500000
25% 1.075000 -3.700000
50% 1.400000 -2.900000
75% 4.250000 -2.100000
max 7.100000 -1.300000
obj=pd.Series(['a','a','b','c']*4)
obj
Output=
0 a
1 a
2 b
3 c
4 a
5 a
6 b
7 c
8 a
9 a
10 b
11 c
12 a
13 a
14 b
15 c
dtype: object
obj.describe()
Output=count 16
unique 3
top a
freq 8
dtype: object
Data loading,storage and file formats:
The tools & libraries for data analysis are of little use if one
can’t easily import and export data in Python. We will be focused on
input and output with pandas objects, though there are of course
numerous tools in other libraries to aid in this process.
Input and output typically falls into a few main categories:
Reading text files and other more efficient on-disk formats
Loading data from databases
Interacting with network sources like web APIs.
I am giving you a txt file for practice.,in future you have to work with the database.
Download txt file (temp)
df=pd.read_csv("temp.txt")
df
Output=
S.No Name Age City Salary DOB
0 1 Vishal NaN Nagpur 20000 22-12-1998
1 2 Pranay 32.0 Mumbai 3000 23-02-1991
2 3 Akshay 43.0 Banglore 8300 12-05-1985
3 4 Ram 38.0 Hyderabad 3900 01-12-1992
print(df.shape)
Output=(4, 6)
df=pd.read_csv("temp.txt",usecols=["Name","Age"])
df
Output=
Name Age
0 Vishal NaN
1 Pranay 32.0
2 Akshay 43.0
3 Ram 38.0
df=pd.read_csv("temp.txt",index_col=['S.No'])
df
Output=
S.No Name Age City Salary DOB
1 Vishal NaN Nagpur 20000 22-12-1998
2 Pranay 32.0 Mumbai 3000 23-02-1991
3 Akshay 43.0 Banglore 8300 12-05-1985
4 Ram 38.0 Hyderabad 3900 01-12-1992
df.dtypes
Output=
Name object
Age float64
City object
Salary int64
DOB object
dtype: object
date_cols=['DOB']
df=pd.read_csv('temp.txt',parse_dates=date_cols)
df
Output=
S.No Name Age City Salary DOB
0 1 Vishal NaN Nagpur 20000 1998-12-22
1 2 Pranay 32.0 Mumbai 3000 1991-02-23
2 3 Akshay 43.0 Banglore 8300 1985-12-05
3 4 Ram 38.0 Hyderabad 3900 1992-01-12
df['DOB'].dt.year
Output=0 1998
1 1991
2 1985
3 1992
Name: DOB, dtype: int64
df=pd.read_csv('temp.txt',names=['a','b','c','d','e','f'])
df
Output=
a b c d e f
0 S.No Name Age City Salary DOB
1 1 Vishal NaN Nagpur 20000 22-12-1998
2 2 Pranay 32 Mumbai 3000 23-02-1991
3 3 Akshay 43 Banglore 8300 12-05-1985
4 4 Ram 38 Hyderabad 3900 01-12-1992
df=pd.read_csv('temp.txt',names=['a','b','c','d','e','f'],header=0)
df
Output=
a b c d e f
0 1 Vishal NaN Nagpur 20000 22-12-1998
1 2 Pranay 32.0 Mumbai 3000 23-02-1991
2 3 Akshay 43.0 Banglore 8300 12-05-1985
3 4 Ram 38.0 Hyderabad 3900 01-12-1992
df=pd.read_csv('temp.txt',skiprows=2,names=['a','b','c','d','e','f'],header=0)
df
Output=
a b c d e f
0 3 Akshay 43 Banglore 8300 12-05-1985
1 4 Ram 38 Hyderabad 3900 01-12-1992
df=pd.read_csv("temp.txt")
df.loc[0,'Age']=21
df
Output=
S.No Name Age City Salary DOB
0 1 Vishal 21.0 Nagpur 20000 22-12-1998
1 2 Pranay 32.0 Mumbai 3000 23-02-1991
2 3 Akshay 43.0 Banglore 8300 12-05-1985
3 4 Ram 38.0 Hyderabad 3900 01-12-1992
In this way you have to do the operations on different files.
This topic is a vast topic, try to understand it and practise
regularly.
msbtenotes:)
Comments
Post a Comment
If you have any query, please let us know