Pandas with Python

Hello everyone!!!!

So in this blog, we are going to learn about pandas.....

So let's get started.....

Course content:

Introduction to pandas

Data Structures

Series

DataFrame

Re-indexing

Operation between series and dataframe

Sorting and Ranking

Descriptive statistics

Data loading, storage, and file formats

Introduction to Pandas:

Pandas is an open-source, BSD-licensed Python library providing
high-performance, easy-to-use data structures and data analysis
tools for the Python programming language.

Python with Pandasis used in a wide range of fields including academic
and commercial domains including finance, economics, statistics,
analytics, etc.

Fast and efficient DataFrame object with default and customized
indexing.

Tools for loading data into in-memory data objects from different file
formats.

Data alignment and integrated handling of missing data.

Reshaping and pivoting of date sets.

Label-based slicing, indexing, and sub setting of large data
sets.

Columns from a data structure can be deleted or inserted.

Group by data for aggregation and transformations.

High-performance merging and joining of data.

Data Structures:

Pandas deals with the following three data structures —

Series:- 1 Dimensional

DataFrame:- 2 Dimensional

Panel:- 3 Dimensional

These data structures are built on top of a Numpy array, which means they
are fast.

The best way to think of these data structures is that the higher
dimensional data structure is a container of its lower-dimensional data
structure.

For example,DataFrameis a container of Series, Panel is a container of
DataFrame.

Series:

Series is a 1-dimensional array like structure with homogenous data
capable of holding data of any type (int, float, string, python objects,
etc.)

The axis labeled is collectively called index.

Key points: Homogenous data, Size immutable, Values of data
mutable.

Pandas series can be created by using the-

pandas.series(data,index,dtype,copy)

  
import pandas as pd
import numpy as np
s=pd.Series(dtype=float)
print(s)
print(type(s))
Output=Series([], dtype: float64)
class 'pandas.core.series.Series'>
data=np.array(['a','b','c','d'])
print(data)
Output=['a' 'b' 'c' 'd']
s=pd.Series(data)
s
Output=
0    a
1    b
2    c
3    d
dtype: object
s=pd.Series(data,index=[111,222,333,444])
s
Output=
111    a
222    b
333    c
444    d
dtype: object

data={'a':0,'b':1,'c':2}
print(data)
Output={'a': 0, 'b': 1, 'c': 2}
s=pd.Series(data)
s
Output=
a    0
b    1
c    2
dtype: int64
s=pd.Series(data,index=['b','c','d','a'])
s
Output=
b    1.0
c    2.0
d    NaN
a    0.0
dtype: float64
s=pd.Series(5,index=[0,1,2,3])
s
Output=
0    5
1    5
2    5
3    5
dtype: int64
s=pd.Series([1,2,3,4,5],index=['a','b','c','d','e'])
s
Output=
a    1
b    2
c    3
d    4
e    5
dtype: int64
print(s[0])
print(s[:3])
print(s[-3:])
print(s['a'])
print(s[['a','c','d']])
Output=
1

a    1
b    2
c    3
dtype: int64

c    3
d    4
e    5
dtype: int64

1

a    1
c    3
d    4
dtype: int64

                     Data Operations

s=pd.Series(np.random.randn(8))
print(s)
Output=
0   -0.263335
1    0.598104
2    2.115186
3    0.163075
4   -0.529759
5    1.268830
6    0.215765
7    1.313002
dtype: float64
s.axes
Output=[RangeIndex(start=0, stop=8, step=1)]
s2=pd.Series(np.random.randn(4),index=[11,12,13,14])
s2
Output=11    0.661457
12    0.112885
13   -0.334969
14    0.350130
dtype: float64
s2.axes
Output=[Int64Index([11, 12, 13, 14], dtype='int64')]
s.dtypes
Output=dtype('float64')
s.ndim
Output=1
s.size
Output=8
se=pd.Series(dtype=float)
print(se)
Output=Series([], dtype: float64)
se.empty
Output=True
s.empty
Output=False
s.values
Output=array([-0.26333516,  0.59810371,  2.11518637,  0.16307502, -0.52975919,
        1.26882994,  0.21576517,  1.31300243])
s.head(2)
Output=
0   -0.263335
1    0.598104
dtype: float64

s.tail(2)
Output=
6    0.215765
7    1.313002
dtype: float64

s.head()
Output=
0   -0.263335
1    0.598104
2    2.115186
3    0.163075
4   -0.529759
dtype: float64

s.tail()
Output=
3    0.163075
4   -0.529759
5    1.268830
6    0.215765
7    1.313002
dtype: float64

DataFrame:

A dataframe is a two-dimensional data structure, i.e., data is aligned
in a tabular fashion in rows and columns.

Features of dataframe:

Potentially columns are of different types

Size is mutable

Labeled axes (rows and columns)

Can perform arithmetic operations on rows and columns.

A pandas Dataframe can be created using various inputs like-

pandas.Dataframe(data,index,columns,dtype,copy)

Lists

Dictionary

Series

Numpy ndarrays

Another DataFrame


df=pd.DataFrame()
print(df)
Output=
Empty DataFrame
Columns: []
Index: []
data=[11,22,33,44,55]
df=pd.DataFrame(data)
print(df)
Output=   
0
0  11
1  22
2  33
3  44
4  55
data=[["Alok",10],["Bhushan",20],["Anil",30]]
df=pd.DataFrame(data,columns=["Name","age"])
print(df)
Output=      
	  Name  age
0     Alok   10
1  Bhushan   20
2     Anil   30
data=[["Alok",10],["Bhushan",20],["Anil",30]]
df=pd.DataFrame(data,columns=["Name","age"],dtype=float)
print(df)
Output=   
	  Name   age
0     Alok  10.0
1  Bhushan  20.0
2     Anil  30.0
data={"Name":["Swapnil","Anil","Anup","Viraj"],
     "Age":[20,21,22,24]}
print(data)
Output={'Name': ['Swapnil', 'Anil', 'Anup', 'Viraj'], 'Age': [20, 21, 22, 24]}
df=pd.DataFrame(data)
print(df)
Output= 
	Name    Age
0  Swapnil   20
1     Anil   21
2     Anup   22
3    Viraj   24

d={'Names':pd.Series(["Praanay","Prem","Atul","Amar","Sarthak"]),"Age":pd.Series([20,25,21,22,23]),"Rating":pd.Series([2.2,2.3,5.3,1.6,4.5])}
df=pd.DataFrame(d,columns=['Names',"Age","Rating"])
print(df)
Output=
	Names  Age  Rating
0  Praanay   20     2.2
1     Prem   25     2.3
2     Atul   21     5.3
3     Amar   22     1.6
4  Sarthak   23     4.5
print(df.T)
Output=              0     1     2     3        4
Names   Praanay  Prem  Atul  Amar  Sarthak
Age          20    25    21    22       23
Rating      2.2   2.3   5.3   1.6      4.5
print(df.axes)
Output=[RangeIndex(start=0, stop=5, step=1), Index(['Names', 'Age', 'Rating'], dtype='object')]
print(df.ndim)
Output=2
print(df.shape)
Output=(5, 3)
print(df.size)
Output=15
print(df.values)
Output=
[['Praanay' 20 2.2]
 ['Prem' 25 2.3]
 ['Atul' 21 5.3]
 ['Amar' 22 1.6]
 ['Sarthak' 23 4.5]]

data=pd.DataFrame(np.arange(16).reshape(4,4),index=["Indore","Raipur","Nagpur","Hyderabad"],columns=['one','two','three','four'])
print(data)
Output=	
one	two	three	four
Indore	0	1	2	3
Raipur	4	5	6	7
Nagpur	8	9	10	11
Hyderabad	12	13	14	15

Re-Indexing:

A critical method on pandas objects is reindex(), which means to create
a new object with the data conformed to a new index.

For ordered data like time series, it may be desirable to do some
interpolation or filling of values when reindexing.

The method option allows us to do this, using a method such as ffil
which forward fills the values.

states=["Raipur","Indore","Hyderabad"]
frame.reindex(columns=states)
Output=
Raipur	Indore	Hyderabad
a	2	NaN	NaN
b	5	NaN	NaN
c	8	NaN	NaN
frame.reindex(index=['a','b','c','d'],columns=states)
Output=
Raipur	Indore	Hyderabad
a	2.0	NaN	NaN
b	5.0	NaN	NaN
c	8.0	NaN	NaN
d	NaN	NaN	NaN

Drop Command:

With the dataframe, index values can be deleted from either axis.

Dropping one or more entries from an axis is easy if one has an index
array or list without those entries. As that can require a bit of set
logic, the drop method will return a new object with the indicated value
or values deleted from an axis.

obj=pd.Series(np.arange(5),index=['a','b','c','d','e'])
print(obj)
Output=
a    0
b    1
c    2
d    3
e    4
dtype: int32
new_obj=obj.drop('c')
new_obj
Output=
a    0
b    1
d    3
e    4
dtype: int32
new_obj=obj.drop(['c','d'])
new_obj
Output=
a    0
b    1
e    4
dtype: int32

Arithmetic and data alignment:

One of the most important pandas features is the behavior of arithmetic
between objects with different indexes.

When adding together objects, if any index pairs are not the same, the
respective index in the result will be the union of the index
pairs.

The internal data alignment introduces NAN values in the indices that
don't overlap.

In the case of dataframe, alignment is performed on both the rows and
the columns, which returns a dataframe whose index and columns are the
unions of the ones in each dataframe.

Relatively, when reindexing a series or dataframe, one can also specify
a different fill value.

Operations between Dataframe & Series:

As with NumPy arrays, arithmetic between Dataframe and series is well
defined.

By default, arithmetic between Dataframe and series matches the index
of the series on the dataframe's columns, broadcasting down the
rows.

If an index value is not found in either the dataframe columns or the
series index, the objects will be reindexed to form the union.

If one wants to instead broadcast over the columns, matching on the
rows, one has to use arithmetic methods.

series2=pd.Series(range(3),index=['b','e','f'])
series2
Output=
b    0
e    1
f    2
dtype: int64
print(frame)
Output=
b	d	e
Raipur	0	1	2
Nagpur	3	4	5
hyderabad	6	7	8
indore	9	10	11
frame+series2
Output=	
b	d	e	f
Raipur	0.0	NaN	3.0	NaN
Nagpur	3.0	NaN	6.0	NaN
hyderabad	6.0	NaN	9.0	NaN
indore	9.0	NaN	12.0	NaN
series3=frame['d']
series3
Output=
Raipur        1
Nagpur        4
hyderabad     7
indore       10
Name: d, dtype: int32
frame
Output=
b	d	e
Raipur	0	1	2
Nagpur	3	4	5
hyderabad	6	7	8
indore	9	10	11
frame.sub(series3,axis="index")
Output=
b	d	e
Raipur	-1	0	1
Nagpur	-1	0	1
hyderabad	-1	0	1
indore	-1	0	1

Function application and mapping:

NumPy ufuncs (element-wise array methods) work fine with pandas
objects.

Another frequent operation is applying a function on 1D arrays to each
column or row. DataFrame’s apply method does exactly this.

Many of the most common array statistics (like sum and mean) are
DataFram methods, so using apply is not necessary.

The function passed to apply need not return a scalar value, it can
also return a scaler value it also returns a Series with multiple
values.

frame=pd.DataFrame(np.random.randn(4,3),columns=list('bde'),index=['raipur','nagpur','hyderabad','indore'])
frame
Output=
b	d	e
raipur	1.977048	-1.860493	0.768591
nagpur	-1.498661	-2.329090	0.222861
hyderabad	0.110777	-0.467806	-0.943308
indore	-0.033976	-0.147853	0.157741

np.abs
Output= ufunc 'absolute'>

f=lambda x:x.max()-x.min()
frame.apply(f)
Output=
b    3.475709
d    2.181237
e    1.711899
dtype: float64

frame.apply(f,axis='columns')
Output=
raipur       3.837542
nagpur       2.551952
hyderabad    1.054084
indore       0.305594
dtype: float64

def f(x):
    return pd.Series([x.min(),x.max()],index=['min','max'])
frame.apply(f)
Output=
		b	       d	      e
min	-1.498661	-2.329090	-0.943308
max	1.977048	-0.147853	0.768591

Sorting and Ranking:

Sorting a data set by some criterion is another important built-in
operation. To sort lexicographically by row or column index, use the
sort_index () method, which returns a new, sorted object.

With a dataframe, one can sort by index on either axis. The data is
sorted in ascending order by default but can be sorted in descending
order too.

The rank methods for Series and DataFrame are the place to look; by
default, rank breaks ties by assigning each group the mean rank.

Ranks can also be assigned according to the order they’re observed in
the data.

Naturally, one can rank in descending order, too.

obj=pd.Series(range(4),index=['d','a','b','c'])
obj
Output=
d    0
a    1
b    2
c    3
dtype: int64
obj.sort_index()
Output=
a    1
b    2
c    3
d    0
dtype: int64

frame=pd.DataFrame(np.arange(8).reshape((2,4)),index=['three','one'],columns=['d','a','b','c'])
print(frame)
Output=
d	a	b	c
three	0	1	2	3
one	4	5	6	7

frame.sort_index()
Output=	
d	a	b	c
one	4	5	6	7
three	0	1	2	3

frame.sort_index(axis=1)
Output=	
a	b	c	d
three	1	2	3	0
one	5	6	7	4

frame=pd.DataFrame({'b':[4,7,-3,2],'a':[0,1,0,1]})
frame
Output=
b	a
0	4	0
1	7	1
2	-3	0
3	2	1

frame.sort_values(by='b')
Output=
b	a
2	-3	0
3	2	1
0	4	0
1	7	1

Axis indexes with duplicate values:

Up until now all of the examples we have seen, had unique axis labels
(index values).While many pandas functions (like reindex()) require that
the labels be unique, it’s notmandatory.

The index's is_unique property can tell you whether its values are
unique or not.

Data selection is one of the main things that behaves differently with
duplicates. Indexing a value with multiple entries returns Series while
single entries return a scalar value.

obj=pd.Series(range(5),index=['a','a','b','b','c'])
obj
Output=
a    0
a    1
b    2
b    3
c    4
dtype: int64
obj.index.is_unique
Output=False
obj['a']
Output=
a    0
a    1
dtype: int64
obj['c']
Output=4
df=pd.DataFrame(np.random.randn(4,3),index=['a','a','b','c'])
df
Output=
0	1	2
a	-0.948989	-0.236842	1.203461
a	-1.186551	0.934325	-1.282523
b	0.679511	-1.089725	1.387880
c	0.743163	-0.895804	0.361094

df.loc['b']
Output=
0    0.679511
1   -1.089725
2    1.387880
Name: b, dtype: float64

df.loc['a']
Output=
0	1	2
a	-0.948989	-0.236842	1.203461
a	-1.186551	0.934325	-1.282523

Descriptive statistics with pandas:

Pandas objects are equipped with a set of common mathematical and
statistical methods. Most of these fall into the category of reductions
or summary statistics, methods that extract a single value (like the sum
or mean) from a Series or a Series of values from the rows or columns of
a DataFrame. Compared with the equivalent methods of NumPy arrays, they
are all built from the ground up to exclude missing data.

NA values are excluded unless the entire slice is NA. This can be
disable using skipna option.

df=pd.DataFrame([[1.4,np.nan],[7.1,-4.5],[np.nan,np.nan],[0.75,-1.3]],index = ['a','b','c','d'],columns=['one','two'])
df
Output=
one	two
a	1.40	NaN
b	7.10	-4.5
c	NaN	NaN
d	0.75	-1.3

df.sum()
Output=
one    9.25
two   -5.80
dtype: float64

df.sum(axis='columns')
Output=
a    1.40
b    2.60
c    0.00
d   -0.55
dtype: float64

df.mean(axis='columns',skipna=False)
Output=
a      NaN
b    1.300
c      NaN
d   -0.275
dtype: float64

df.cumsum()
Output=
one	two
a	1.40	NaN
b	8.50	-4.5
c	NaN	NaN
d	9.25	-5.8

df.describe()
Output=
one	two
count	3.000000	2.000000
mean	3.083333	-2.900000
std	3.493685	2.262742
min	0.750000	-4.500000
25%	1.075000	-3.700000
50%	1.400000	-2.900000
75%	4.250000	-2.100000
max	7.100000	-1.300000

obj=pd.Series(['a','a','b','c']*4)
obj
Output=
0     a
1     a
2     b
3     c
4     a
5     a
6     b
7     c
8     a
9     a
10    b
11    c
12    a
13    a
14    b
15    c
dtype: object

obj.describe()
Output=count     16
unique     3
top        a
freq       8
dtype: object

Data loading,storage and file formats:

The tools & libraries for data analysis are of little use if one
can’t easily import and export data in Python. We will be focused on
input and output with pandas objects, though there are of course
numerous tools in other libraries to aid in this process.

Input and output typically falls into a few main categories:

Reading text files and other more efficient on-disk formats

Loading data from databases

Interacting with network sources like web APIs.

I am giving you a txt file for practice.,in future you have to work with the database.

Download txt file (temp)

df=pd.read_csv("temp.txt")
df
Output=
S.No	Name	Age	City	Salary	DOB
0	1	Vishal	NaN	Nagpur	20000	22-12-1998
1	2	Pranay	32.0	Mumbai	3000	23-02-1991
2	3	Akshay	43.0	Banglore	8300	12-05-1985
3	4	Ram	38.0	Hyderabad	3900	01-12-1992

print(df.shape)
Output=(4, 6)
df=pd.read_csv("temp.txt",usecols=["Name","Age"])
df
Output=
Name	Age
0	Vishal	NaN
1	Pranay	32.0
2	Akshay	43.0
3	Ram	38.0

df=pd.read_csv("temp.txt",index_col=['S.No'])
df
Output=
S.No Name	Age	City	Salary	DOB
					
1	Vishal	NaN	Nagpur	20000	22-12-1998
2	Pranay	32.0	Mumbai	3000	23-02-1991
3	Akshay	43.0	Banglore	8300	12-05-1985
4	Ram	38.0	Hyderabad	3900	01-12-1992

df.dtypes
Output=
Name       object
Age       float64
City       object
Salary      int64
DOB        object
dtype: object

date_cols=['DOB']
df=pd.read_csv('temp.txt',parse_dates=date_cols)
df
Output=
S.No	Name	Age	City	Salary	DOB
0	1	Vishal	NaN	Nagpur	20000	1998-12-22
1	2	Pranay	32.0	Mumbai	3000	1991-02-23
2	3	Akshay	43.0	Banglore	8300	1985-12-05
3	4	Ram	38.0	Hyderabad	3900	1992-01-12

df['DOB'].dt.year
Output=0    1998
1    1991
2    1985
3    1992
Name: DOB, dtype: int64

df=pd.read_csv('temp.txt',names=['a','b','c','d','e','f'])
df
Output=
a	b	c	d	e	f
0	S.No	Name	Age	City	Salary	DOB
1	1	Vishal	NaN	Nagpur	20000	22-12-1998
2	2	Pranay	32	Mumbai	3000	23-02-1991
3	3	Akshay	43	Banglore	8300	12-05-1985
4	4	Ram	38	Hyderabad	3900	01-12-1992


df=pd.read_csv('temp.txt',names=['a','b','c','d','e','f'],header=0)
df
Output=
a	b	c	d	e	f
0	1	Vishal	NaN	Nagpur	20000	22-12-1998
1	2	Pranay	32.0	Mumbai	3000	23-02-1991
2	3	Akshay	43.0	Banglore	8300	12-05-1985
3	4	Ram	38.0	Hyderabad	3900	01-12-1992

df=pd.read_csv('temp.txt',skiprows=2,names=['a','b','c','d','e','f'],header=0)
df
Output=
a	b	c	d	e	f
0	3	Akshay	43	Banglore	8300	12-05-1985
1	4	Ram	38	Hyderabad	3900	01-12-1992

df=pd.read_csv("temp.txt")
df.loc[0,'Age']=21
df
Output=
S.No	 Name	Age 	City	Salary	DOB
0	1	Vishal	21.0	Nagpur	20000	22-12-1998
1	2	Pranay	32.0	Mumbai	3000	23-02-1991
2	3	Akshay	43.0	Banglore	8300	12-05-1985
3	4	Ram	38.0	Hyderabad	3900	01-12-1992

In this way you have to do the operations on different files.

This topic is a vast topic, try to understand it and practise
regularly.

Best regards from,

msbtenotes:)

THANK YOU!!!

-Student Reviews-

Thank you Sir your website cwipedia is very helpful

Nilesh

👍👍This site is very usefull all the questions and answers all very properly arranged

Unknown

This site is very useful //helful for practice the MCQ question and answer

Namrata

It is very useful thank you being here

Dhanashree

so useful site, seaved me from exams thanks

Shruti

Thanks you sir you are doing great job for students...... Please keep it up its really helpful for all students I also suggested my all friends to refer these mcqs for exam preparation.......

Shivam

Thank you for your quick response..

Payal

Thank you very much, so usefull site for diploma students

Very good site 👍🏻

Anirudh

50,000+ Reviews