# scikit-learn Cheatsheet scikit-learn is a popular open-source machine learning library for Python. It provides a wide range of tools for building and evaluating machine learning models, including classification, regression, clustering, and more. This cheatsheet provides a quick reference for some of scikit-learn's unique features, including code blocks for loading data, preprocessing, model selection, and more. Additionally, it includes a list of resources for further learning. ## Loading Data ```python from sklearn.datasets import load_digits # Load the digits dataset digits = load_digits() # Split the data into training and testing sets from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target, test_size=0.25, random_state=42) ``` ## Preprocessing ```python # Scale the data to have zero mean and unit variance from sklearn.preprocessing import StandardScaler scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train) X_test_scaled = scaler.transform(X_test) # Encode categorical variables as integers from sklearn.preprocessing import LabelEncoder encoder = LabelEncoder() y_train_encoded = encoder.fit_transform(y_train) y_test_encoded = encoder.transform(y_test) ``` ## Model Selection ```python # Train a support vector machine classifier from sklearn.svm import SVC clf = SVC(kernel='linear', C=1.0, random_state=42) clf.fit(X_train_scaled, y_train_encoded) # Evaluate the classifier from sklearn.metrics import accuracy_score y_pred = clf.predict(X_test_scaled) accuracy_score(y_test_encoded, y_pred) ``` ## Cross-Validation ```python # Perform k-fold cross-validation from sklearn.model_selection import cross_val_score scores = cross_val_score(clf, X_train_scaled, y_train_encoded, cv=5) ``` ## Grid Search ```python # Perform a grid search over hyperparameters from sklearn.model_selection import GridSearchCV param_grid = {'C': [0.1, 1.0, 10.0], 'kernel': ['linear', 'rbf']} grid_search = GridSearchCV(clf, param_grid, cv=5) grid_search.fit(X_train_scaled, y_train_encoded) ``` ## Other Useful Features ```python # Train a decision tree classifier from sklearn.tree import DecisionTreeClassifier clf = DecisionTreeClassifier(max_depth=3, random_state=42) clf.fit(X_train_scaled, y_train_encoded) # Visualize the decision tree from sklearn.tree import plot_tree plot_tree(clf) # Save and load a model import joblib joblib.dump(clf, 'model.joblib') clf = joblib.load('model.joblib') ``` ## Resources - [scikit-learn documentation](https://scikit-learn.org/stable/documentation.html) - [scikit-learn tutorials](https://scikit-learn.org/stable/tutorial/index.html) - [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/index.html)