Skip to main content

Command Palette

Search for a command to run...

Pandas and Numpy in PYTHON

Published
10 min read
Pandas and Numpy in PYTHON
A

I am a versatile full-stack developer with expertise in both modern and traditional web technologies. My skill set encompasses the MERN (MongoDB, Express.js, React.js, Node.js) stack, enabling me to build scalable and efficient web applications with ease. Additionally, I have extensive experience in PHP, allowing me to tackle a wide range of projects and integrate legacy systems seamlessly. With a passion for problem-solving and a keen eye for detail, I strive to deliver high-quality solutions that exceed expectations. My dedication to staying updated with the latest industry trends and best practices ensures that my work is always cutting-edge and future-proof.

  1. Mastering Pandas: The Ultimate Beginner’s Guide to Data Handling in Python

Data is everywhere — but raw data is messy, inconsistent, and rarely ready for analysis.
This is where Pandas comes to the rescue.

Pandas is a Python library for data manipulation and analysis, built on top of NumPy.
It gives you powerful tools to clean, explore, and transform datasets efficiently — whether they’re in CSV files, SQL tables, JSON APIs, or Excel sheets.


🔹 What is Pandas?

Pandas stands for “Python Data Analysis Library”.
It provides easy-to-use data structures — mainly Series and DataFrame — to work with structured data.

Think of a Series as a single column in Excel, and a DataFrame as a full spreadsheet with rows and columns.


1. Installation & Import

pip install pandas
import pandas as pd

2. Pandas Data Structures

🔹 Series

  • 1-dimensional labeled array.

  • Can hold any data type.

s = pd.Series([10, 20, 30, 40])
print(s)

🔹 DataFrame

  • 2-dimensional labeled data structure (like a spreadsheet or SQL table).

  • Can hold heterogeneous data types.

data = {'Name': ['Alice', 'Bob'], 'Age': [25, 30]}
df = pd.DataFrame(data)
print(df)


3. Creating DataFrames

  1. From dictionary
data = {'Name': ['Alice', 'Bob'], 'Age': [25, 30]}
df = pd.DataFrame(data)
  1. From list of lists
data = [['Alice', 25], ['Bob', 30]]
df = pd.DataFrame(data, columns=['Name', 'Age'])
  1. From CSV/Excel
df = pd.read_csv('data.csv')
df = pd.read_excel('data.xlsx')

4. Viewing & Inspecting Data

df.head()       # First 5 rows
df.tail()       # Last 5 rows
df.shape        # returns a tuple like (number_of_rows, number_of_columns)
df.info()       # Summary of dataframe
df.describe()   # Statistical summary for numeric columns
df.columns      # Column names
df.index        # Row labels


5. Selecting Data

🔹 Selecting Columns

df['Name']
df[['Name', 'Age']]

🔹 Selecting Rows

When working with DataFrames, it’s common to select specific rows. Pandas provides two main ways: iloc and loc.

iloc – Index-based Selection

  • iloc stands for integer-location based selection.

  • You use it when you want to select rows by their integer position (row number).

  • Syntax: df.iloc[row_index] or df.iloc[start:end]

loc – Label-based Selection

  • loc stands for label-based selection.

  • You use it when you want to select rows by their index label.

  • Syntax: df.loc[row_label] or df.loc[start_label:end_label]

df.iloc[0]      # First row by index
df.loc[0]       # First row by label
df.iloc[0:3]    # First three rows

🔹 Conditional Selection

df[df['Age'] > 25]
df[(df['Age'] > 20) & (df['Name'] == 'Alice')]

6. Modifying Data

  • Add new column
df['Salary'] = [50000, 60000]
  • Modify existing column
df['Age'] = df['Age'] + 1
  • Rename columns
df.rename(columns={'Age':'Years'}, inplace=True)
  • Drop column/row
df.drop('Salary', axis=1, inplace=True)  # Column
df.drop(0, axis=0, inplace=True)         # Row

7. Handling Missing Data

  • Check for missing values
df.isnull()
df.isnull().sum()
  • Fill missing values
df.fillna(0, inplace=True)
df['Column'].fillna(df['Column'].mean(), inplace=True)
  • Drop missing values
df.dropna(inplace=True)

8. Data Cleaning

  • Remove duplicates:
df.drop_duplicates(inplace=True)
  • Strip whitespace:
df['Name'] = df['Name'].str.strip()
  • Change case:
df['Name'] = df['Name'].str.upper()

9. Sorting Data

  • By column:
df.sort_values('Age', ascending=True, inplace=True)
  • By index:
df.sort_index(inplace=True)

10. Aggregation & Grouping

  • Basic statistics
df['Age'].sum()
df['Age'].mean()
df['Age'].max()
df['Age'].min()
df['Age'].std()
  • Group by
df.groupby('Department')['Salary'].mean()
  • Multiple aggregations
df.groupby('Department')['Salary'].agg(['mean', 'sum', 'max'])

11. Merging, Joining & Concatenation

  • Concatenate
pd.concat([df1, df2], axis=0)  # Stack rows
pd.concat([df1, df2], axis=1)  # Stack columns
  • Merge / Join
pd.merge(df1, df2, on='Key', how='inner')  # inner, left, right, outer

12. Applying Functions

  • Using apply()
df['Age_plus_5'] = df['Age'].apply(lambda x: x + 5)
  • Vectorized operations
df['Salary'] = df['Salary'] * 1.1

13. Working with Dates

df['JoinDate'] = pd.to_datetime(df['JoinDate'])
df['Year'] = df['JoinDate'].dt.year
df['Month'] = df['JoinDate'].dt.month
df['Day'] = df['JoinDate'].dt.day

14. Pivot Tables

df.pivot_table(values='Salary', index='Department', columns='Gender', aggfunc='mean')

15. Exporting Data

df.to_csv('output.csv', index=False)
df.to_excel('output.xlsx', index=False)

16. Visualization with Pandas

df['Salary'].plot(kind='hist')        # Histogram
df.plot(x='Age', y='Salary', kind='scatter')  # Scatter plot
df['Department'].value_counts().plot(kind='bar')  # Bar plot

17. Tips for Beginners

  • Start with small datasets to understand operations.

  • Use head() and tail() often to inspect data.

  • Chain operations carefully: df.dropna().groupby('Dept')['Salary'].mean().

  • Remember Pandas is built on NumPy, so vectorized operations are faster than loops.


Conclusion

Pandas is an essential tool for data cleaning, transformation, and analysis in Python.
Once you master it, tasks that used to take hours in Excel or SQL can be done in a few lines of code.

# 🐼 PANDAS COMPLETE PRACTICE CODE FOR BEGINNERS
# ----------------------------------------------

# 1️⃣ Importing Pandas
import pandas as pd

# 2️⃣ Creating Series
s = pd.Series([10, 20, 30, 40])
print("Series:\n", s, "\n")
# Output:
# 0    10
# 1    20
# 2    30
# 3    40
# dtype: int64

# 3️⃣ Creating DataFrame
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [25, 30, 28, 22],
    'City': ['Delhi', 'Mumbai', 'Bangalore', 'Chennai']
}
df = pd.DataFrame(data)
print("DataFrame:\n", df, "\n")
# Output:
#      Name  Age       City
# 0   Alice   25      Delhi
# 1     Bob   30     Mumbai
# 2  Charlie   28  Bangalore
# 3   David   22    Chennai

# 4️⃣ Exploring Data
print("Head:\n", df.head(), "\n")
print("Info:")
print(df.info())
print("\nDescribe:\n", df.describe(), "\n")
print("Columns:", df.columns.tolist(), "\n")
print("Shape:", df.shape, "\n")

# 5️⃣ Selecting Data
print("Select Column:\n", df['Name'], "\n")
# Output:
# 0     Alice
# 1       Bob
# 2    Charlie
# 3      David

print("Select Multiple Columns:\n", df[['Name', 'City']], "\n")

print("Select by Index (loc):\n", df.loc[0], "\n")
# Output:
# Name    Alice
# Age        25
# City     Delhi
# Name: 0, dtype: object

print("Select by Position (iloc):\n", df.iloc[1], "\n")

# Conditional selection
print("Age > 25:\n", df[df['Age'] > 25], "\n")

# 6️⃣ Add, Modify, Delete Columns
df['Country'] = 'India'
df['Age'] = df['Age'] + 1
df.drop('City', axis=1, inplace=True)
print("After Modifications:\n", df, "\n")
# Output:
#      Name  Age Country
# 0   Alice   26   India
# 1     Bob   31   India
# 2  Charlie   29   India
# 3   David   23   India

# 7️⃣ Handling Missing Values
df.loc[2, 'Age'] = None
print("With Missing Value:\n", df, "\n")
print("Check NaN:\n", df.isnull(), "\n")
print("Fill Missing:\n", df.fillna(0), "\n")
print("Drop Missing:\n", df.dropna(), "\n")

# 8️⃣ Aggregation and Statistics
print("Mean Age:", df['Age'].mean())
print("Sum Age:", df['Age'].sum(), "\n")
# Output:
# Mean Age: 26.6666666667
# Sum Age: 80.0

# GroupBy example
group_data = {
    'City': ['Delhi', 'Delhi', 'Mumbai', 'Mumbai'],
    'Sales': [200, 250, 300, 400]
}
sales_df = pd.DataFrame(group_data)
print("Group By City (Mean Sales):\n", sales_df.groupby('City')['Sales'].mean(), "\n")
# Output:
# City
# Delhi     225.0
# Mumbai    350.0
# Name: Sales, dtype: float64

# 9️⃣ Sorting & Filtering
print("Sorted by Sales Desc:\n", sales_df.sort_values('Sales', ascending=False), "\n")
print("Filter with condition (Sales > 250):\n", sales_df[sales_df['Sales'] > 250], "\n")

# 🔟 Merging & Joining DataFrames
df1 = pd.DataFrame({'id': [1, 2, 3], 'Name': ['A', 'B', 'C']})
df2 = pd.DataFrame({'id': [1, 2, 3], 'Salary': [50000, 60000, 55000]})
merged = pd.merge(df1, df2, on='id', how='inner')
print("Merged DataFrame:\n", merged, "\n")
# Output:
#    id Name  Salary
# 0   1    A   50000
# 1   2    B   60000
# 2   3    C   55000

# 1️⃣1️⃣ Concatenating
concat_df = pd.concat([df1, df2], axis=1)
print("Concatenated DataFrame:\n", concat_df, "\n")
# Output (side by side):
#    id Name  id  Salary
# 0   1    A   1   50000
# 1   2    B   2   60000
# 2   3    C   3   55000

# 1️⃣2️⃣ Working with Dates
date_data = pd.DataFrame({
    'Date': ['2024-01-01', '2024-06-15', '2024-10-05']
})
date_data['Date'] = pd.to_datetime(date_data['Date'])
date_data['Year'] = date_data['Date'].dt.year
date_data['Month'] = date_data['Date'].dt.month
print("Date Operations:\n", date_data, "\n")
# Output:
#         Date  Year  Month
# 0 2024-01-01  2024      1
# 1 2024-06-15  2024      6
# 2 2024-10-05  2024     10

# 1️⃣3️⃣ Applying Functions
df = pd.DataFrame({'Age': [15, 22, 35, 45]})
df['AgeGroup'] = df['Age'].apply(lambda x: 'Adult' if x >= 18 else 'Minor')
print("Apply Function Example:\n", df, "\n")
# Output:
#    Age AgeGroup
# 0   15    Minor
# 1   22    Adult
# 2   35    Adult
# 3   45    Adult

# 1️⃣4️⃣ Pivot Table
pivot_df = pd.DataFrame({
    'Region': ['North', 'South', 'North', 'East'],
    'Sales': [200, 300, 400, 250]
})
pivot = pd.pivot_table(pivot_df, values='Sales', index='Region', aggfunc='sum')
print("Pivot Table:\n", pivot, "\n")
# Output:
#         Sales
# Region       
# East      250
# North     600
# South     300

# 1️⃣5️⃣ Useful Functions
demo_df = pd.DataFrame({
    'City': ['Delhi', 'Mumbai', 'Delhi', 'Bangalore'],
    'Age': [25, 30, 25, 40]
})
print("Value Counts:\n", demo_df['City'].value_counts(), "\n")
print("Check Duplicates:\n", demo_df.duplicated(), "\n")
print("Drop Duplicates:\n", demo_df.drop_duplicates(), "\n")
print("Rename Column:\n", demo_df.rename(columns={'City': 'Location'}), "\n")
print("Random Sample:\n", demo_df.sample(n=2), "\n")

# 1️⃣6️⃣ Export Data
# demo_df.to_csv('final_output.csv', index=False)
# print("Data exported successfully!")
  1. NumPy – Python for Data Science

Introduction

NumPy (Numerical Python) is a fundamental library in Python for scientific computing and data analysis.

It provides:

  • ndarray → N-dimensional array for storing numbers

  • Fast operations on arrays (vectorized computations)

  • Mathematical, statistical, and linear algebra functions

NumPy is the foundation for Pandas, SciPy, and Machine Learning libraries like Scikit-learn and TensorFlow.


1. Installation & Import

pip install numpy
import numpy as np

Output: Nothing, just imports the library.


2. NumPy Arrays

NumPy arrays are like Python lists but faster and support vectorized operations.

Create 1D Array

arr = np.array([1, 2, 3, 4])
print(arr)

Output:

[1 2 3 4]

Create 2D Array

arr2d = np.array([[1,2,3],[4,5,6]])
print(arr2d)

Output:

[[1 2 3]
 [4 5 6]]

3. Array Attributes

print(arr.shape)    # Shape of array
print(arr2d.shape)
print(arr.ndim)     # Number of dimensions
print(arr2d.ndim)
print(arr.dtype)    # Data type

Output:

(4,)
(2, 3)
1
2
int64

4. Creating Arrays with Built-in Functions

np.zeros(5)          # Array of zeros
np.ones((2,3))       # Array of ones
np.arange(0,10,2)    # Numbers from 0 to 10 with step 2
np.linspace(0,1,5)   # 5 numbers evenly spaced between 0 and 1
np.eye(3)            # Identity matrix

Output Examples:

np.zeros(5) → [0. 0. 0. 0. 0.]
np.ones((2,3)) →
[[1. 1. 1.]
 [1. 1. 1.]]
np.arange(0,10,2) → [0 2 4 6 8]
np.linspace(0,1,5) → [0.   0.25 0.5  0.75 1.  ]
np.eye(3) →
[[1. 0. 0.]
 [0. 1. 0.]
 [0. 0. 1.]]

5. Indexing & Slicing

1D Array

arr = np.array([10,20,30,40,50])
print(arr[0])       # First element
print(arr[1:4])     # Slice

Output:

10
[20 30 40]

2D Array

arr2d = np.array([[1,2,3],[4,5,6],[7,8,9]])
print(arr2d[0,1])   # Row 0, Column 1
print(arr2d[1,:])   # Row 1, all columns
print(arr2d[:,2])   # All rows, column 2

Output:

2
[4 5 6]
[3 6 9]

6. Array Operations

NumPy supports element-wise operations:

a = np.array([1,2,3])
b = np.array([4,5,6])

print(a+b)   # [5 7 9]
print(a-b)   # [-3 -3 -3]
print(a*b)   # [4 10 18]
print(a/b)   # [0.25 0.4 0.5]
print(a**2)  # [1 4 9]

7. Universal Functions (ufunc)

NumPy provides fast mathematical functions:

arr = np.array([1,4,9,16])

print(np.sqrt(arr))    # [1. 2. 3. 4.]
print(np.exp(arr))     # Exponentials
print(np.log(arr))     # Natural log
print(np.sin(arr))     # Trigonometric functions

8. Aggregation Functions

arr = np.array([1,2,3,4,5])
print(arr.sum())       # 15
print(arr.mean())      # 3.0
print(arr.std())       # 1.4142
print(arr.min())       # 1
print(arr.max())       # 5
print(arr.argmin())    # Index of min → 0
print(arr.argmax())    # Index of max → 4

9. Reshaping Arrays

arr = np.arange(1,13)
print(arr.reshape(3,4))

Output:

[[ 1  2  3  4]
 [ 5  6  7  8]
 [ 9 10 11 12]]
  • Flatten back: arr.reshape(3,4).ravel()

10. Stacking Arrays

a = np.array([1,2,3])
b = np.array([4,5,6])

np.vstack((a,b))  # Vertical stack
np.hstack((a,b))  # Horizontal stack

Output:

vstack →
[[1 2 3]
 [4 5 6]]

hstack → [1 2 3 4 5 6]

11. Boolean Indexing

arr = np.array([10,20,30,40,50])
print(arr[arr>25])   # [30 40 50]

Select elements based on conditions.


12. Copy vs View

arr = np.array([1,2,3,4])
arr_view = arr.view()
arr_copy = arr.copy()

arr_view[0] = 100
arr_copy[1] = 200

print(arr)       # arr affected by view, not copy
print(arr_view)
print(arr_copy)

13. Random Numbers

np.random.seed(0)
print(np.random.randint(0,10,5))      # Random integers
print(np.random.rand(3,3))            # Uniform random floats
print(np.random.randn(3,3))           # Normal distribution

14. Linear Algebra

A = np.array([[1,2],[3,4]])
B = np.array([[5,6],[7,8]])

print(np.dot(A,B))      # Matrix multiplication
print(np.linalg.inv(A)) # Inverse
print(np.linalg.det(A)) # Determinant

15. Tips for Beginners

  • NumPy arrays are faster than Python lists for numeric operations.

  • Always try vectorized operations instead of loops.

  • Use reshape, ravel, and flatten to adjust dimensions.

  • Boolean indexing is very powerful for filtering data.


Conclusion

NumPy is the foundation of Python data science. Once you master it, you can perform fast numerical computations, array manipulations, and linear algebra operations with ease.

N

Thanks for the tutorial 👍🏼

1