Pandas - Mati & Caleb
This lesson introduces students to the Pandas library in Python for data analysis and manipulation, covering topics such as data loading, table creation, manipulation, and visualization using real-world examples.
- before we start this portion of the lesson:
- Overview:
- Learning Objectives:
- what is pandas?
- Question #2 & 3:
- but why is pandas useful?
- Question #4:
- how do i flipping use it? its so hard, my puny brain cant understand it
- example code on how to load a csv into a chart
- how to manipulate the data in pandas.
- how do i put it into a chart 😩
- Hacks
before we start this portion of the lesson:
check if you have pip installed since we are going to be installing some libraries today!!!!!! if you arnt sure if you have pip, check it by running this command:
pip
if your terminal says "command not found" or something else on linux, run this:
python3 -m ensurepip --default-pip
Overview:
Pandas is a powerful tool in Python that is used for data analysis and manipulation. In this lesson, we will explore how to use Pandas to work with datasets, analyze them, and visualize the results.
Learning Objectives:
By the end of this lesson, students should be able to:
- Understand what Pandas is and why it is useful for data analysis
- Load data into Pandas and create tables to store it
- Use different functions in Pandas to manipulate data, such as filtering, sorting, and grouping
- Visualize data using graphs and charts
what is pandas?
this:
- Pandas is a Python library used for data analysis and manipulation.
- it can handle different types of data, including CSV files and databases.
- it also allows you to create tables to store and work with your data.
- it has functions for filtering, sorting, and grouping data to make it easier to work with.
- it also has tools for visualizing data with graphs and charts.
- it is widely used in the industry for data analysis and is a valuable skill to learn.
- companies that use Pandas include JPMorgan Chase, Google, NASA, the New York Times, and many others.
but why is pandas useful?
- it can provides tools for handling and manipulating tabular data, which is a common format for storing and analyzing data.
- it can handle different types of data, including CSV files and databases.
- it allows you to perform tasks such as filtering, sorting, and grouping data, making it easier to analyze and work with.
- it has functions for handling missing data and can fill in or remove missing values, which is important for accurate data analysis.
- it also has tools for creating visualizations such as graphs and charts, making it easier to communicate insights from the data.
- it is fast and efficient, even for large datasets, which is important for time-critical data analysis.
- it is widely used in the industry and has a large community of users and developers, making it easy to find support and resources.
import pandas as pd
df = pd.read_csv('yourcsvfileidcjustpickoneidiot.csv')
print(df.head())
print("Average age:", df['Age'].mean())
females = df[df['Gender'] == 'Female']
print(females)
sorted_data = df.sort_values(by='Salary', ascending=False)
print(sorted_data)
uh oh!!! no pandas 😢
if see this error, enter these into your terminal:
pip install wheel
pip install pandas
on stack overflow, it said pandas is disturbed through pip as a wheel. so you need that too.
link to full forum if curious: https://stackoverflow.com/questions/33481974/importerror-no-module-named-pandas
ps: do this for this to work on ur laptop:
wget https://raw.githubusercontent.com/KKcbal/amongus/master/_notebooks/files/example.csv
import pandas as pd
# read the CSV file
df = pd.read_csv('/files/example.csv')
# print the first five rows
print(df.head())
# define a function to assign each age to an age group
def assign_age_group(age):
if age < 30:
return '<30'
elif age < 40:
return '30-40'
elif age < 50:
return '40-50'
else:
return '>50'
# apply the function to the Age column to create a new column with age groups
df['Age Group'] = df['Age'].apply(assign_age_group)
# group by age group and count the number of people in each group
age_counts = df.groupby('Age Group')['Name'].count()
# print the age group counts
print(age_counts)
import pandas as pd
# load the csv file
df = pd.read_csv('example.csv')
# print the first five rows
print(df.head())
# filter the data to include only people aged 30 or older
df_filtered = df[df['Age'] >= 30]
# sort the data by age in descending order
df_sorted = df.sort_values('Age', ascending=False)
# group the data by gender and calculate the mean age for each group
age_by_gender = df.groupby('Gender')['Age'].mean()
# print the filtered data
print(df_filtered)
# print the sorted data
print(df_sorted)
# print the mean age by gender
print(age_by_gender)
import pandas as pd
import matplotlib.pyplot as plt
# read the CSV file
df = pd.read_csv('example.csv')
# create a bar chart of the number of people in each age group
age_groups = ['<30', '30-40', '40-50', '>50']
age_counts = pd.cut(df['Age'], bins=[0, 30, 40, 50, df['Age'].max()], labels=age_groups, include_lowest=True).value_counts()
plt.bar(age_counts.index, age_counts.values)
plt.title('Number of people in each age group')
plt.xlabel('Age group')
plt.ylabel('Number of people')
plt.show()
# create a pie chart of the gender distribution
gender_counts = df['Gender'].value_counts()
plt.pie(gender_counts.values, labels=gender_counts.index, autopct='%1.1f%%')
plt.title('Gender distribution')
plt.show()
# create a scatter plot of age vs. income
plt.scatter(df['Age'], df['Income'])
plt.title('Age vs. Income')
plt.xlabel('Age')
plt.ylabel('Income')
plt.show()
uh oh!!!! another error!??!!??!?! install this library:
pip install matplotlib
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
# read the CSV file
df = pd.read_csv('example.csv')
# define age groups
age_groups = ['<30', '30-40', '40-50', '>50']
# create a new column with the age group for each person
df['Age Group'] = pd.cut(df['Age'], bins=[0, 30, 40, 50, np.inf], labels=age_groups, include_lowest=True)
# group by age group and count the number of people in each group
age_counts = df.groupby('Age Group')['Name'].count()
# create a bar chart of the age counts
age_counts.plot(kind='bar')
# set the title and axis labels
plt.title('Number of People in Each Age Group')
plt.xlabel('Age Group')
plt.ylabel('Number of People')
# show the chart
plt.show()
import pandas as pd
print("Here is my soccer dataframe:")
df = pd.read_csv('soccer.csv')
print(df)
# Filtering only the players that are older than 30
df_filtered = df[df['Age'] >= 30]
# print(df_filtered)
print("")
print("Here are the players from youngest to oldest")
df_sorted = df.sort_values('Age', ascending=True)
print(df_sorted)
plt.figure(figsize=(8, 6))
plt.bar(df['Player'], df['Age'])
plt.xlabel('Player')
plt.ylabel('Age')
plt.title('Age of Soccer Players')
plt.xticks(rotation=90)
plt.show()
Questions
- What are the two primary data structures in pandas and how do they differ?
- The two primary data structures in pandas are Series and DataFrame. A Series is a one-dimensional array-like object that can hold any data type. A DataFrame is a two-dimensional table-like data structure with rows and columns, similar to a spreadsheet.
- How do you read a CSV file into a pandas DataFrame?
- To read a CSV file into a pandas DataFrame, you can use the read_csv function in pandas.
- How do you select a single column from a pandas DataFrame?
- To select a single column from a pandas DataFrame, you can use the indexing operator [] with the column name
- How do you filter rows in a pandas DataFrame based on a condition?
- To filter rows in a pandas DataFrame based on a condition, you can use boolean indexing.
- How do you group rows in a pandas DataFrame by a particular column?
- To group rows in a pandas DataFrame by a particular column, you can use the groupby method.
- How do you aggregate data in a pandas DataFrame using functions like sum and mean?
- To aggregate data in a pandas DataFrame using functions like sum and mean, you can use the agg method.
- How do you handle missing values in a pandas DataFrame?
- To handle missing values in a pandas DataFrame, you can use the fillna method to fill in missing values with a specific value or method, or you can use the dropna method to remove rows with missing values.
- How do you merge two pandas DataFrames together?
- To merge two pandas DataFrames together, you can use the merge method
- How do you export a pandas DataFrame to a CSV file?
- To export a pandas DataFrame to a CSV file, you can use the to_csv method
- What is the difference between a Series and a DataFrame in Pandas?
- The main difference between a Series and a DataFrame in pandas is that a Series is a one-dimensional array-like object, while a DataFrame is a two-dimensional table-like data structure. A Series can be thought of as a single column of a DataFrame, while a DataFrame can have multiple columns.
numpy hacks
from skimage import io
photo = io.imread('waldo.png')
type(photo)
import matplotlib.pyplot as plt
plt.imshow(photo)
photo.shape
plt.imshow(photo[100:300, 350:400])
Here's an example of another numpy function that we can use to calculate the mean of an array:
import numpy as np
arr = np.array([1, 2, 3, 4, 5])
mean = np.mean(arr)
print(mean)
Data Analysis Hacks
How can Numpy and Pandas be used to preprocess data for predictive analysis?
- Numpy and Pandas can be used to preprocess data for predictive analysis by performing tasks such as data cleaning, normalization, scaling, feature selection, and feature engineering. They transform raw data into a format suitable for machine learning algorithms.
What machine learning algorithms can be used for predictive analysis, and how do they differ?
- Machine learning algorithms that can be used for predictive analysis include linear regression, logistic regression, decision trees, random forests, SVM, KNN, and neural networks. These algorithms differ in terms of their complexity, interpretability, and performance on different types of data.
Can you discuss some real-world applications of predictive analysis in different industries? -Predictive analysis can be used in different industries for applications such as fraud detection, customer segmentation, demand forecasting, predictive maintenance, and risk assessment.
Can you explain the role of feature engineering in predictive analysis, and how it can improve model accuracy?
- Feature engineering is the process of creating new features from existing data to improve model accuracy. It can involve transformations such as scaling, normalization, and one-hot encoding.
How can machine learning models be deployed in real-time applications for predictive analysis?
- Machine learning models can be deployed in real-time applications for predictive analysis using techniques such as model serving and containerization.
Can you discuss some limitations of Numpy and Pandas, and when it might be necessary to use other data analysis tools?
- Limitations of Numpy and Pandas include their memory requirements for large datasets and their lack of support for distributed computing. Other data analysis tools may be necessary for these cases.
How can predictive analysis be used to improve decision-making and optimize business processes?
- Predictive analysis can be used to improve decision-making and optimize business processes by providing insights into customer behavior, market trends, and operational performance. It can also help to reduce costs and improve efficiency in different industries.