🐍 Python for Data Analysis Tutorial

Personal reference for Python in academic research

Getting Started with Python

Installation

Recommended: Anaconda Distribution

• Includes Python, Jupyter, and scientific libraries

• Available at: anaconda.com

• Alternative: Python.org for basic installation

Essential Libraries

Data Analysis: pandas, numpy

Visualization: matplotlib, seaborn

Scientific Computing: scipy, statsmodels

Machine Learning: scikit-learn

Development Environment

Jupyter Notebooks: Interactive development

VS Code: Full-featured IDE

PyCharm: Professional Python IDE

Spyder: Scientific Python IDE

Learning Resources

Documentation: python.org/doc

Tutorials: Real Python, Python.org

Practice: Kaggle, HackerRank

Books: "Python for Data Analysis" by McKinney

Python Basics for Research

Variables and Data Types

# Basic data types
name = "John Doe"           # String
age = 25                    # Integer
gpa = 3.85                  # Float
is_student = True           # Boolean

# Lists and dictionaries
scores = [85, 92, 78, 96]   # List
student = {                 # Dictionary
    "name": "John",
    "age": 25,
    "gpa": 3.85
}
                        

Control Structures

# If statements
if gpa >= 3.5:
    print("Dean's List")
elif gpa >= 3.0:
    print("Good Standing")
else:
    print("Needs Improvement")

# Loops
for score in scores:
    print(f"Score: {score}")

# While loop
count = 0
while count < 5:
    print(count)
    count += 1
                        

Functions

# Function definition
def calculate_average(numbers):
    """Calculate average of a list of numbers"""
    return sum(numbers) / len(numbers)

# Function with default parameters
def greet(name, greeting="Hello"):
    return f"{greeting}, {name}!"

# Using functions
avg_score = calculate_average(scores)
message = greet("Alice", "Hi")
                        

File Operations

# Reading files
with open('data.txt', 'r') as file:
    content = file.read()

# Writing files
with open('results.txt', 'w') as file:
    file.write("Analysis Results\n")
    file.write(f"Average: {avg_score}")

# Working with CSV
import csv
with open('data.csv', 'r') as file:
    reader = csv.reader(file)
    for row in reader:
        print(row)
                        

Data Analysis with Pandas

Creating DataFrames

import pandas as pd

# From dictionary
data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'Score': [85, 92, 78]
}
df = pd.DataFrame(data)

# From CSV file
df = pd.read_csv('data.csv')

# From Excel file
df = pd.read_excel('data.xlsx')
                        

Data Exploration

# Basic info
print(df.head())        # First 5 rows
print(df.tail())        # Last 5 rows
print(df.info())        # Data types and info
print(df.describe())    # Summary statistics

# Shape and columns
print(df.shape)         # (rows, columns)
print(df.columns)       # Column names
print(df.dtypes)        # Data types
                        

Data Manipulation

# Selecting data
df['Name']              # Single column
df[['Name', 'Age']]     # Multiple columns
df.loc[0]               # Row by index
df.loc[df['Age'] > 30]  # Conditional selection

# Adding columns
df['Grade'] = df['Score'] / 10
df['Pass'] = df['Score'] >= 80

# Grouping and aggregation
df.groupby('Pass')['Score'].mean()
df.groupby('Grade').size()
                        

Data Cleaning

# Missing values
df.isnull().sum()       # Count missing values
df.dropna()             # Remove rows with NaN
df.fillna(0)            # Fill NaN with 0

# Duplicates
df.duplicated().sum()   # Count duplicates
df.drop_duplicates()    # Remove duplicates

# Data types
df['Age'] = df['Age'].astype(int)
df['Score'] = pd.to_numeric(df['Score'])
                        

Data Visualization

Matplotlib Basics

import matplotlib.pyplot as plt

# Basic plot
plt.figure(figsize=(10, 6))
plt.plot(df['Age'], df['Score'])
plt.xlabel('Age')
plt.ylabel('Score')
plt.title('Age vs Score')
plt.show()

# Scatter plot
plt.scatter(df['Age'], df['Score'])
plt.xlabel('Age')
plt.ylabel('Score')
plt.show()
                        

Seaborn Visualizations

import seaborn as sns

# Distribution plot
sns.histplot(df['Score'])
plt.show()

# Box plot
sns.boxplot(x='Pass', y='Score', data=df)
plt.show()

# Correlation heatmap
corr = df.corr()
sns.heatmap(corr, annot=True)
plt.show()
                        

Statistical Analysis

from scipy import stats

# Descriptive statistics
mean_score = df['Score'].mean()
median_score = df['Score'].median()
std_score = df['Score'].std()

# T-test
t_stat, p_value = stats.ttest_1samp(df['Score'], 80)

# Correlation
correlation, p_val = stats.pearsonr(df['Age'], df['Score'])

print(f"Mean: {mean_score:.2f}")
print(f"Correlation: {correlation:.3f}")
                        

Export Results

# Save DataFrame
df.to_csv('results.csv', index=False)
df.to_excel('results.xlsx', index=False)

# Save plot
plt.figure(figsize=(10, 6))
plt.plot(df['Age'], df['Score'])
plt.savefig('age_score_plot.png', dpi=300)
plt.show()

# Generate report
summary = df.describe()
summary.to_csv('summary_stats.csv')
                        

Research Workflow Tips

Project Organization

  • Create separate folders for data, scripts, and results
  • Use descriptive file names
  • Keep raw data separate from processed data
  • Document your code with comments
  • Use version control (Git)

Code Best Practices

  • Write readable, well-commented code
  • Use meaningful variable names
  • Break complex operations into functions
  • Test your code with sample data
  • Handle errors gracefully

Reproducibility

  • Set random seeds for reproducible results
  • Document your environment (requirements.txt)
  • Keep detailed analysis logs
  • Share code and data when possible
  • Use Jupyter notebooks for exploration

Common Pitfalls

  • Not backing up your work
  • Overwriting original data
  • Not documenting analysis steps
  • Ignoring data quality issues
  • Not validating results