Table of Contents

    Pandas Basics

    Pandas is one of the most powerful and widely used Python libraries for Data Analysis, Data Science, Machine Learning (ML), and Artificial Intelligence.

    Pandas helps developers and data scientists work with structured data efficiently.

    The name Pandas comes from:

    Panel Data

    Why Pandas is Important in ML

    Machine Learning systems require data preprocessing, cleaning, transformation, and analysis.

    Pandas makes these tasks:

    • Simple
    • Fast
    • Efficient
    • Easy to understand

    Pandas is heavily used for:

    • Data cleaning
    • Data transformation
    • Feature engineering
    • Exploratory Data Analysis (EDA)
    • Handling CSV and Excel files

    Installing Pandas

    Pandas can be installed using pip.

    
    pip install pandas
    

    Importing Pandas

    Pandas is commonly imported using the alias pd.

    
    import pandas as pd
    

    Main Data Structures in Pandas

    Pandas mainly provides:

    1. Series
    2. DataFrame

    What is a Series?

    A Series is a one-dimensional labeled array.

    It can store:

    • Integers
    • Strings
    • Floating-point values
    • Boolean values

    Creating a Series

    
    import pandas as pd
    
    data = [10, 20, 30]
    
    series = pd.Series(data)
    
    print(series)
    

    Output

    
    0    10
    1    20
    2    30
    dtype: int64
    

    What is a DataFrame?

    A DataFrame is a two-dimensional table consisting of rows and columns.

    It is the most important structure in Pandas.

    DataFrame Representation

    :contentReference[oaicite:0]{index=0}

    Creating a DataFrame

    
    import pandas as pd
    
    data = {
        "Name": ["John", "Sara"],
        "Age": [25, 30]
    }
    
    df = pd.DataFrame(data)
    
    print(df)
    

    Output

    
       Name  Age
    0  John   25
    1  Sara   30
    

    Advantages of Pandas

    • Easy data handling
    • Powerful data analysis
    • Efficient processing
    • Works well with NumPy
    • Supports large datasets

    Reading CSV Files

    Pandas can read CSV files easily.

    
    df = pd.read_csv("data.csv")
    
    print(df)
    

    Reading Excel Files

    
    df = pd.read_excel("data.xlsx")
    

    Viewing Data

    Pandas provides functions to inspect data quickly.

    head() Function

    Displays the first rows.

    
    print(df.head())
    

    tail() Function

    Displays the last rows.

    
    print(df.tail())
    

    Checking Data Information

    info() Function

    
    print(df.info())
    

    This shows:

    • Column names
    • Data types
    • Missing values

    Checking Statistical Information

    describe() Function

    
    print(df.describe())
    

    It provides:

    • Mean
    • Median
    • Standard deviation
    • Minimum value
    • Maximum value

    Mean Formula

    :contentReference[oaicite:1]{index=1}

    Selecting Columns

    
    print(df["Name"])
    

    Selecting Multiple Columns

    
    print(df[["Name", "Age"]])
    

    Selecting Rows

    Pandas provides:

    • loc[]
    • iloc[]

    Using loc[]

    
    print(df.loc[0])
    

    Using iloc[]

    
    print(df.iloc[1])
    

    Filtering Data

    Filtering selects rows based on conditions.

    
    print(df[df["Age"] > 25])
    

    Filtering Condition

    :contentReference[oaicite:2]{index=2}

    Adding New Columns

    
    df["Salary"] = [50000, 60000]
    
    print(df)
    

    Updating Data

    
    df.loc[0, "Age"] = 26
    

    Deleting Columns

    
    df.drop("Salary", axis=1, inplace=True)
    

    Handling Missing Values

    Missing data is common in Machine Learning datasets.

    Checking Missing Values

    
    print(df.isnull())
    

    Removing Missing Values

    
    df.dropna()
    

    Filling Missing Values

    
    df.fillna(0)
    

    Data Cleaning in ML

    Data cleaning improves Machine Learning performance.

    Common cleaning tasks:

    • Removing duplicates
    • Handling null values
    • Correcting data types
    • Filtering invalid records

    Removing Duplicate Values

    
    df.drop_duplicates()
    

    Sorting Data

    
    df.sort_values("Age")
    

    Grouping Data

    Grouping is useful for aggregation and analysis.

    
    df.groupby("Department").mean()
    

    Aggregation Example

    :contentReference[oaicite:3]{index=3}

    Merging DataFrames

    Pandas supports combining datasets.

    
    pd.merge(df1, df2, on="ID")
    

    Concatenating DataFrames

    
    pd.concat([df1, df2])
    

    Exporting Data

    Saving to CSV

    
    df.to_csv("output.csv")
    

    Saving to Excel

    
    df.to_excel("output.xlsx")
    

    Pandas with NumPy

    Pandas works closely with NumPy arrays.

    
    import numpy as np
    
    arr = np.array([1, 2, 3])
    
    series = pd.Series(arr)
    

    Pandas in Machine Learning

    Pandas is heavily used in:

    • Dataset preprocessing
    • Feature engineering
    • Data visualization preparation
    • Statistical analysis

    Example ML Workflow with Pandas

    
    import pandas as pd
    
    df = pd.read_csv("data.csv")
    
    df = df.dropna()
    
    X = df[["Age", "Salary"]]
    
    y = df["Purchased"]
    

    Advantages of Pandas in ML

    • Easy dataset handling
    • Efficient preprocessing
    • Supports large datasets
    • Powerful analysis tools

    Limitations of Pandas

    • Memory intensive for huge datasets
    • Can be slower than specialized big data tools
    • Learning curve for beginners

    Best Practices

    • Use vectorized operations
    • Avoid unnecessary loops
    • Handle missing values carefully
    • Use meaningful column names

    Real-World Example

    In an e-commerce recommendation system, Pandas helps:

    • Analyze customer data
    • Process transaction records
    • Prepare features for ML models
    • Generate business insights

    Future of Pandas

    Pandas remains one of the most important data analysis tools in Python.

    Future improvements may include:

    • Better performance optimization
    • Enhanced cloud integration
    • Improved big data support
    • Advanced analytics features

    Conclusion

    Pandas is an essential library for Data Science and Machine Learning.

    It helps developers:

    • Handle datasets efficiently
    • Clean and preprocess data
    • Perform analysis easily
    • Build powerful ML pipelines

    Mastering Pandas is a critical step toward becoming a successful Data Scientist or Machine Learning engineer.