Feature Engineering for Machine Learning in Python

Date: 2019-09-03 Tue Category: misc

嚯嚯嚯

Chapter 1, 'Creating Features'.

Selecting specific data types

# Create subset of only the numeric columns
so_numeric_df = so_survey_df.select_dtypes(include=['int', 'float'])

# Print the column names contained in so_survey_df_num
print(so_numeric_df.columns)

Dealing with categorical features

One-hot encoding
- n features for n categories
- Explainable
Dummy encoding
- n-1 features for n categories
- Necessary information

# One hot
pd.get_dummies(df, columns=['Country'], prefix='C')

# Dummy
pd.get_dummies(df, columns=['Country'], drop_first=True, prefix='C')

Limiting columns

# 将出现次数少于10次的国家归入Other
# Create a series out of the Country column
countries = so_survey_df['Country']

# Get the counts of each category
country_counts = countries.value_counts()

# Create a mask for only categories that occur less than 10 times
mask = countries.isin(country_counts[country_counts < 10].index)

# Label all other categories as Other
countries[mask] = 'Other'

# Print the updated category counts
print(countries.value_counts())

Numeric variables

Binarizing columns
Bining numeric variables

# 等间距cut
so_survey_df['equal_binned'] = pd.cut(so_survey_df['ConvertedSalary'], 5)

# 指定边界cut
# Import numpy
import numpy as np

# Specify the boundaries of the bins
bins = [-np.inf, 10000, 50000, 100000, 150000, np.inf]

# Bin labels
labels = ['Very low', 'Low', 'Medium', 'High', 'Very high']

# Bin the continuous variable ConvertedSalary using these boundaries
so_survey_df['boundary_binned'] = pd.cut(so_survey_df['ConvertedSalary'], 
                                         bins=bins, labels=labels)

# Print the first 5 rows of the boundary_binned column
print(so_survey_df[['boundary_binned', 'ConvertedSalary']].head())

Missing data

Listwise deletion

# drop row at least one na
df.dropna(how='any')

# drop specific columns
df.dropna(subset=[])

Fillna

df['cl'].fillna(
    value='xxx', inplace=True
)

Fill continuous missing values

Mean
Median

Dealing with other data issues

Bad character - Numeric column has nan or other characters - convert data type - use isna find stray characters

Data distributions

# box plot
# Create a boxplot of two columns
so_numeric_df[['Age', 'Years Experience']].boxplot()
plt.show()


#

Scaling and transformations

长尾数据使用log变换

Removing outliers

Use quantile

# Find the 95th quantile
quantile = so_numeric_df['ConvertedSalary'].quantile(0.95)

# Trim the outliers
trimmed_df = so_numeric_df[so_numeric_df['ConvertedSalary'] < quantile]

Scaling and transforming new data

Don't use test data. Avoid data leakage.

Encoding text

clean text

# Replace all non letter characters with a whitespace
speech_df['text_clean'] = speech_df['text'].str.replace('[^a-zA-Z]', ' ')

# Change to lower case
speech_df['text_clean'] = speech_df['text_clean'].str.lower()

# Print the first 5 rows of the text_clean column
print(speech_df['text_clean'].head())

Heigh level feature

# Find the length of each text
speech_df['char_cnt'] = speech_df['text_clean'].str.len()

# Count the number of words in each text
speech_df['word_cnt'] = speech_df['text_clean'].str.split().str.len()

# Find the average length of word
speech_df['avg_word_length'] = speech_df.char_cnt / speech_df.word_cnt

# Print the first 5 rows of these columns
print(speech_df[['text_clean', 'char_cnt', 'word_cnt', 'avg_word_length']])

Word count

# Import CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer

# Instantiate CountVectorizer
cv = CountVectorizer()

# Fit the vectorizer
cv.fit(speech_df['text_clean'])

# Print feature names
print(cv.get_feature_names())

Term frequency-inverse document frequency

这部分都不太想看了... 用不到的时候就会忘记