Python Tutorial - Donghao Song

Getting Started with Python

Installation

Recommended: Anaconda Distribution

• Includes Python, Jupyter, and scientific libraries

• Available at: anaconda.com

• Alternative: Python.org for basic installation

Essential Libraries

Data Analysis: pandas, numpy

Visualization: matplotlib, seaborn

Scientific Computing: scipy, statsmodels

Machine Learning: scikit-learn

Development Environment

Jupyter Notebooks: Interactive development

VS Code: Full-featured IDE

PyCharm: Professional Python IDE

Spyder: Scientific Python IDE

Learning Resources

Documentation: python.org/doc

Tutorials: Real Python, Python.org

Practice: Kaggle, HackerRank

Books: "Python for Data Analysis" by McKinney

Python Basics for Research

Variables and Data Types

# Basic data types
name = "John Doe"           # String
age = 25                    # Integer
gpa = 3.85                  # Float
is_student = True           # Boolean

# Lists and dictionaries
scores = [85, 92, 78, 96]   # List
student = {                 # Dictionary
    "name": "John",
    "age": 25,
    "gpa": 3.85
}

Control Structures

# If statements
if gpa >= 3.5:
    print("Dean's List")
elif gpa >= 3.0:
    print("Good Standing")
else:
    print("Needs Improvement")

# Loops
for score in scores:
    print(f"Score: {score}")

# While loop
count = 0
while count < 5:
    print(count)
    count += 1

Functions

# Function definition
def calculate_average(numbers):
    """Calculate average of a list of numbers"""
    return sum(numbers) / len(numbers)

# Function with default parameters
def greet(name, greeting="Hello"):
    return f"{greeting}, {name}!"

# Using functions
avg_score = calculate_average(scores)
message = greet("Alice", "Hi")

File Operations

# Reading files
with open('data.txt', 'r') as file:
    content = file.read()

# Writing files
with open('results.txt', 'w') as file:
    file.write("Analysis Results\n")
    file.write(f"Average: {avg_score}")

# Working with CSV
import csv
with open('data.csv', 'r') as file:
    reader = csv.reader(file)
    for row in reader:
        print(row)

Data Analysis with Pandas

Creating DataFrames

import pandas as pd

# From dictionary
data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'Score': [85, 92, 78]
}
df = pd.DataFrame(data)

# From CSV file
df = pd.read_csv('data.csv')

# From Excel file
df = pd.read_excel('data.xlsx')

Data Exploration

# Basic info
print(df.head())        # First 5 rows
print(df.tail())        # Last 5 rows
print(df.info())        # Data types and info
print(df.describe())    # Summary statistics

# Shape and columns
print(df.shape)         # (rows, columns)
print(df.columns)       # Column names
print(df.dtypes)        # Data types

Data Manipulation

# Selecting data
df['Name']              # Single column
df[['Name', 'Age']]     # Multiple columns
df.loc[0]               # Row by index
df.loc[df['Age'] > 30]  # Conditional selection

# Adding columns
df['Grade'] = df['Score'] / 10
df['Pass'] = df['Score'] >= 80

# Grouping and aggregation
df.groupby('Pass')['Score'].mean()
df.groupby('Grade').size()

Data Cleaning

# Missing values
df.isnull().sum()       # Count missing values
df.dropna()             # Remove rows with NaN
df.fillna(0)            # Fill NaN with 0

# Duplicates
df.duplicated().sum()   # Count duplicates
df.drop_duplicates()    # Remove duplicates

# Data types
df['Age'] = df['Age'].astype(int)
df['Score'] = pd.to_numeric(df['Score'])

Data Visualization

Matplotlib Basics

import matplotlib.pyplot as plt

# Basic plot
plt.figure(figsize=(10, 6))
plt.plot(df['Age'], df['Score'])
plt.xlabel('Age')
plt.ylabel('Score')
plt.title('Age vs Score')
plt.show()

# Scatter plot
plt.scatter(df['Age'], df['Score'])
plt.xlabel('Age')
plt.ylabel('Score')
plt.show()

Seaborn Visualizations

import seaborn as sns

# Distribution plot
sns.histplot(df['Score'])
plt.show()

# Box plot
sns.boxplot(x='Pass', y='Score', data=df)
plt.show()

# Correlation heatmap
corr = df.corr()
sns.heatmap(corr, annot=True)
plt.show()

Statistical Analysis

from scipy import stats

# Descriptive statistics
mean_score = df['Score'].mean()
median_score = df['Score'].median()
std_score = df['Score'].std()

# T-test
t_stat, p_value = stats.ttest_1samp(df['Score'], 80)

# Correlation
correlation, p_val = stats.pearsonr(df['Age'], df['Score'])

print(f"Mean: {mean_score:.2f}")
print(f"Correlation: {correlation:.3f}")

Export Results

# Save DataFrame
df.to_csv('results.csv', index=False)
df.to_excel('results.xlsx', index=False)

# Save plot
plt.figure(figsize=(10, 6))
plt.plot(df['Age'], df['Score'])
plt.savefig('age_score_plot.png', dpi=300)
plt.show()

# Generate report
summary = df.describe()
summary.to_csv('summary_stats.csv')

Research Workflow Tips

Project Organization

Create separate folders for data, scripts, and results
Use descriptive file names
Keep raw data separate from processed data
Document your code with comments
Use version control (Git)

Code Best Practices

Write readable, well-commented code
Use meaningful variable names
Break complex operations into functions
Test your code with sample data
Handle errors gracefully

Reproducibility

Set random seeds for reproducible results
Document your environment (requirements.txt)
Keep detailed analysis logs
Share code and data when possible
Use Jupyter notebooks for exploration

Common Pitfalls

Not backing up your work
Overwriting original data
Not documenting analysis steps
Ignoring data quality issues
Not validating results