嚯嚯嚯
Chapter 1, 'Creating Features'.
Selecting specific data types
# Create subset of only the numeric columns
so_numeric_df = so_survey_df.select_dtypes(include=['int', 'float'])
# Print the column names contained in so_survey_df_num
print(so_numeric_df.columns)
Dealing with categorical features
- One-hot encoding
- n features for n categories
- Explainable
- Dummy encoding
- n-1 features for n categories
- Necessary information
# One hot
pd.get_dummies(df, columns=['Country'], prefix='C')
# Dummy
pd.get_dummies(df, columns=['Country'], drop_first=True, prefix='C')
Limiting columns
# 将出现次数少于10次的国家归入Other
# Create a series out of the Country column
countries = so_survey_df['Country']
# Get the counts of each category
country_counts = countries.value_counts()
# Create a mask for only categories that occur less than 10 times
mask = countries.isin(country_counts[country_counts < 10].index)
# Label all other categories as Other
countries[mask] = 'Other'
# Print the updated category counts
print(countries.value_counts())
Numeric variables
- Binarizing columns
- Bining numeric variables
# 等间距cut
so_survey_df['equal_binned'] = pd.cut(so_survey_df['ConvertedSalary'], 5)
# 指定边界cut
# Import numpy
import numpy as np
# Specify the boundaries of the bins
bins = [-np.inf, 10000, 50000, 100000, 150000, np.inf]
# Bin labels
labels = ['Very low', 'Low', 'Medium', 'High', 'Very high']
# Bin the continuous variable ConvertedSalary using these boundaries
so_survey_df['boundary_binned'] = pd.cut(so_survey_df['ConvertedSalary'],
bins=bins, labels=labels)
# Print the first 5 rows of the boundary_binned column
print(so_survey_df[['boundary_binned', 'ConvertedSalary']].head())
Missing data
Listwise deletion
# drop row at least one na
df.dropna(how='any')
# drop specific columns
df.dropna(subset=[])
Fillna
df['cl'].fillna(
value='xxx', inplace=True
)
Fill continuous missing values
- Mean
- Median
Dealing with other data issues
Bad character - Numeric column has nan or other characters - convert data type - use isna find stray characters
Data distributions
# box plot
# Create a boxplot of two columns
so_numeric_df[['Age', 'Years Experience']].boxplot()
plt.show()
#
Scaling and transformations
长尾数据使用log变换
Removing outliers
Use quantile
# Find the 95th quantile
quantile = so_numeric_df['ConvertedSalary'].quantile(0.95)
# Trim the outliers
trimmed_df = so_numeric_df[so_numeric_df['ConvertedSalary'] < quantile]
Scaling and transforming new data
Don't use test data. Avoid data leakage.
Encoding text
clean text
# Replace all non letter characters with a whitespace
speech_df['text_clean'] = speech_df['text'].str.replace('[^a-zA-Z]', ' ')
# Change to lower case
speech_df['text_clean'] = speech_df['text_clean'].str.lower()
# Print the first 5 rows of the text_clean column
print(speech_df['text_clean'].head())
Heigh level feature
# Find the length of each text
speech_df['char_cnt'] = speech_df['text_clean'].str.len()
# Count the number of words in each text
speech_df['word_cnt'] = speech_df['text_clean'].str.split().str.len()
# Find the average length of word
speech_df['avg_word_length'] = speech_df.char_cnt / speech_df.word_cnt
# Print the first 5 rows of these columns
print(speech_df[['text_clean', 'char_cnt', 'word_cnt', 'avg_word_length']])
Word count
# Import CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer
# Instantiate CountVectorizer
cv = CountVectorizer()
# Fit the vectorizer
cv.fit(speech_df['text_clean'])
# Print feature names
print(cv.get_feature_names())
Term frequency-inverse document frequency
这部分都不太想看了... 用不到的时候就会忘记
Comments !