1. What is Scikit-learn?
Ans: Scikit-learn is a popular Python machine-learning library that provides efficient tools for data analysis and modeling. It includes a wide range of algorithms for classification, regression, clustering, dimensionality reduction, and model selection.
2. What are the key features of Scikit-learn?
Ans: Some critical features of Scikit-learn include:
- Simple and consistent API: Scikit-learn provides a uniform interface for various machine learning algorithms.
- Wide range of algorithms: It includes a comprehensive set of algorithms for various tasks, such as classification, regression, clustering, and more.
- Integration with NumPy and SciPy: Scikit-learn seamlessly integrates with other scientific computing libraries in Python.
- Data preprocessing: It provides tools for data preprocessing, such as scaling, encoding categorical variables, and handling missing values.
- Model evaluation: Scikit-learn offers methods for evaluating and comparing models using various metrics and cross-validation techniques.
3. How can you install Scikit-learn?
Ans: Scikit-learn can be installed using pip, which is a package installer for Python. You can run the following command to install Scikit-learn:
pip install scikit-learn
4. What are the main modules in Scikit-learn?
- Ans: Scikit-learn consists of several modules, but some of the main ones are:
- sklearn.datasets: Provides various datasets for practice and experimentation.
- sklearn.preprocessing: Contains functions for data preprocessing, such as scaling, encoding, and imputation.
- sklearn.model_selection: Provides tools for model selection, including cross-validation and hyperparameter tuning.
- sklearn.linear_model: Implements linear models for regression and classification tasks.
- sklearn.tree: Includes classes for decision tree-based models.
- sklearn.ensemble: Contains ensemble methods, such as random forests and gradient boosting.
- sklearn.cluster: Implements clustering algorithms, such as k-means and DBSCAN.
- sklearn.metrics: Provides a wide range of evaluation metrics for assessing model performance.
5. How can you handle missing values in Scikit-learn?
- Ans: Scikit-learn provides a few methods to handle missing values:
- Removing instances: You can remove instances with missing values using the dropna() function.
- Imputation: Scikit-learn provides the SimpleImputer class, which can fill missing values with the mean, median, or most frequent values.
- Ignoring missing values: Some algorithms in Scikit-learn can handle missing values directly, such as decision trees.
6. How can you scale features in Scikit-learn?
Ans: Scikit-learn provides the StandardScaler class for feature scaling. It scales the features to have zero mean and unit variance. You can use it as follows:
from sklearn.preprocessing import StandardScaler scaler = StandardScaler() X_scaled = scaler.fit_transform(X)7. What is cross-validation in Scikit-learn?
Ans: Cross-validation is a technique used to assess the performance of a machine-learning model on unseen data. Scikit-learn provides the cross_val_score function, which performs cross-validation by splitting the data into multiple folds, training the model on some folds, and evaluating it on the remaining fold. It returns an array of scores for each fold, which can be used to estimate the model’s performance.
8. What is regularization?
Ans: Regularization is a technique for preventing overfitting in machine learning models by adding a penalty term to the loss function.
9. What is a decision tree?
Ans: A decision tree is a supervised learning algorithm that is used for classification and regression tasks. It works by recursively splitting the data into subsets based on the values of the input features.
10. What is a random forest?
Ans: A random forest is an ensemble learning algorithm that consists of multiple decision trees. It works by averaging the predictions of the individual decision trees to improve accuracy and reduce overfitting.
11. What is gradient boosting?
Ans: Gradient boosting is an ensemble learning algorithm that works by iteratively adding new models to the ensemble, each one correcting the errors of the previous model.
12. What is clustering?
Ans: Clustering is an unsupervised learning technique that involves grouping similar data points together based on their features.
13. What is K-means clustering?
Ans: K-means clustering is a popular clustering algorithm that works by dividing the data into K clusters, where K is a user-defined parameter.
14. What is Principal Component Analysis (PCA)?
Ans: PCA is a dimensionality reduction technique that involves finding the principal components of the data, which are the linear combinations of the original features that explain the most variance in the data.
15. What are Support Vector Machines (SVM)?
Ans: SVM is a supervised learning algorithm used for classification and regression tasks. It works by finding the hyperplane that separates the data into different classes.
16. What is Grid Search?
Ans: Grid Search is a technique for hyperparameter tuning in machine learning models. It involves searching over a range of hyperparameters to find the best combination that maximizes the model’s performance.
17. What is a ROC curve?
Ans: ROC curve is a graphical representation of the performance of a binary classifier, plotting the true positive rate against the false positive rate for different threshold values.
18. What is AUC?
Ans: AUC stands for Area Under the Curve and is a metric used to evaluate the performance of a binary classifier. It represents the area under the ROC curve and ranges from 0 to 1, with higher values indicating better performance.
19. What is feature scaling?
Ans: Feature scaling is a preprocessing step in machine learning that involves scaling the values of the input features to a common scale, usually between 0 and 1, to improve the performance of certain algorithms.
20. What is a pipeline in Scikit-learn?
Ans: A pipeline in Scikit-learn is a way of chaining multiple machine learning algorithms together, where the output of one algorithm is passed as the input to the next algorithm in the pipeline.
21. When you would use StratifiedKFold instead of KFold?
Ans: KFold is a cross-validator that divides the dataset into k folds. If shuffle is set to False, consecutive folds will be the shifted version of the previous fold. If shuffle is set to True, then the splitting will be random.
StratifiedKFold takes the cross-validation one step further. The class distribution in the dataset is preserved in the training and test splits, i.e. each fold of the dataset has the same proportion of observations with a given label.
Therefore, we should prefer StratifiedKFold over KFold when dealing with classification tasks with imbalanced class distributions.
22. How to predict time series in scikit-learn?
Ans: Time Series is a collection of data points collected at constant time intervals. Time-series prediction is based on the theory that the current values more or less depend on the past ones. A time series has two basic components: Mean and Variance. Ideally, you would like to control this component, for the variability, you can simply apply a logarithm transformation on the data, and for the trend, you can differentiate it.
In the case of the prediction of time series data, RNN or LSTM algorithm (Deep Learning) has been widely utilized, but scikit does not provide the built-in algorithm of it. So, you might be better off studying Tensorflow or Pytorch framework which are common tools to enable you to build the RNN or LSTM model.
23. Is any custom distance function using scikit-learn K-Means Clustering?
Ans: Unfortunately no: by definition, the k-means clustering algorithm relies on the Euclidean distance from the mean of each cluster. It has no metric parameter and it is not trivial to extend k-means to other distances. You could use a different metric, so even though you are still calculating the mean you could use something like the Mahalanobis distance.
24. What does the “fit()” method in scikit-learn do?
Ans: Fitting your model (using the fit() method) to the training data is essentially the training part of the modeling process. The fit() method finds the coefficients for the equation specified via the algorithm being used. During the process, this method modifies the object and returns a reference to the object. After it is trained, the model can be used to make predictions, usually with a .predict() method call.
25. What is a decision boundary?
Ans: A decision boundary is a boundary that separates the regions of feature space that belong to different classes in a classification problem.