Getting Started with Python
Installation
Recommended: Anaconda Distribution
• Includes Python, Jupyter, and scientific libraries
• Available at: anaconda.com
• Alternative: Python.org for basic installation
Essential Libraries
Data Analysis: pandas, numpy
Visualization: matplotlib, seaborn
Scientific Computing: scipy, statsmodels
Machine Learning: scikit-learn
Development Environment
Jupyter Notebooks: Interactive development
VS Code: Full-featured IDE
PyCharm: Professional Python IDE
Spyder: Scientific Python IDE
Learning Resources
Documentation: python.org/doc
Tutorials: Real Python, Python.org
Practice: Kaggle, HackerRank
Books: "Python for Data Analysis" by McKinney
Python Basics for Research
Variables and Data Types
# Basic data types name = "John Doe" # String age = 25 # Integer gpa = 3.85 # Float is_student = True # Boolean # Lists and dictionaries scores = [85, 92, 78, 96] # List student = { # Dictionary "name": "John", "age": 25, "gpa": 3.85 }
Control Structures
# If statements if gpa >= 3.5: print("Dean's List") elif gpa >= 3.0: print("Good Standing") else: print("Needs Improvement") # Loops for score in scores: print(f"Score: {score}") # While loop count = 0 while count < 5: print(count) count += 1
Functions
# Function definition def calculate_average(numbers): """Calculate average of a list of numbers""" return sum(numbers) / len(numbers) # Function with default parameters def greet(name, greeting="Hello"): return f"{greeting}, {name}!" # Using functions avg_score = calculate_average(scores) message = greet("Alice", "Hi")
File Operations
# Reading files with open('data.txt', 'r') as file: content = file.read() # Writing files with open('results.txt', 'w') as file: file.write("Analysis Results\n") file.write(f"Average: {avg_score}") # Working with CSV import csv with open('data.csv', 'r') as file: reader = csv.reader(file) for row in reader: print(row)
Data Analysis with Pandas
Creating DataFrames
import pandas as pd # From dictionary data = { 'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 35], 'Score': [85, 92, 78] } df = pd.DataFrame(data) # From CSV file df = pd.read_csv('data.csv') # From Excel file df = pd.read_excel('data.xlsx')
Data Exploration
# Basic info print(df.head()) # First 5 rows print(df.tail()) # Last 5 rows print(df.info()) # Data types and info print(df.describe()) # Summary statistics # Shape and columns print(df.shape) # (rows, columns) print(df.columns) # Column names print(df.dtypes) # Data types
Data Manipulation
# Selecting data df['Name'] # Single column df[['Name', 'Age']] # Multiple columns df.loc[0] # Row by index df.loc[df['Age'] > 30] # Conditional selection # Adding columns df['Grade'] = df['Score'] / 10 df['Pass'] = df['Score'] >= 80 # Grouping and aggregation df.groupby('Pass')['Score'].mean() df.groupby('Grade').size()
Data Cleaning
# Missing values df.isnull().sum() # Count missing values df.dropna() # Remove rows with NaN df.fillna(0) # Fill NaN with 0 # Duplicates df.duplicated().sum() # Count duplicates df.drop_duplicates() # Remove duplicates # Data types df['Age'] = df['Age'].astype(int) df['Score'] = pd.to_numeric(df['Score'])
Data Visualization
Matplotlib Basics
import matplotlib.pyplot as plt # Basic plot plt.figure(figsize=(10, 6)) plt.plot(df['Age'], df['Score']) plt.xlabel('Age') plt.ylabel('Score') plt.title('Age vs Score') plt.show() # Scatter plot plt.scatter(df['Age'], df['Score']) plt.xlabel('Age') plt.ylabel('Score') plt.show()
Seaborn Visualizations
import seaborn as sns # Distribution plot sns.histplot(df['Score']) plt.show() # Box plot sns.boxplot(x='Pass', y='Score', data=df) plt.show() # Correlation heatmap corr = df.corr() sns.heatmap(corr, annot=True) plt.show()
Statistical Analysis
from scipy import stats # Descriptive statistics mean_score = df['Score'].mean() median_score = df['Score'].median() std_score = df['Score'].std() # T-test t_stat, p_value = stats.ttest_1samp(df['Score'], 80) # Correlation correlation, p_val = stats.pearsonr(df['Age'], df['Score']) print(f"Mean: {mean_score:.2f}") print(f"Correlation: {correlation:.3f}")
Export Results
# Save DataFrame df.to_csv('results.csv', index=False) df.to_excel('results.xlsx', index=False) # Save plot plt.figure(figsize=(10, 6)) plt.plot(df['Age'], df['Score']) plt.savefig('age_score_plot.png', dpi=300) plt.show() # Generate report summary = df.describe() summary.to_csv('summary_stats.csv')
Research Workflow Tips
Project Organization
- Create separate folders for data, scripts, and results
- Use descriptive file names
- Keep raw data separate from processed data
- Document your code with comments
- Use version control (Git)
Code Best Practices
- Write readable, well-commented code
- Use meaningful variable names
- Break complex operations into functions
- Test your code with sample data
- Handle errors gracefully
Reproducibility
- Set random seeds for reproducible results
- Document your environment (requirements.txt)
- Keep detailed analysis logs
- Share code and data when possible
- Use Jupyter notebooks for exploration
Common Pitfalls
- Not backing up your work
- Overwriting original data
- Not documenting analysis steps
- Ignoring data quality issues
- Not validating results