Over the Hedge

A Quick Business Discussion

Published

Some of you might know that my fiancée is a small business owner on top of the many hats that she wears. Today’s post is about a small part of that experience that I’m familiar with. Read along to learn something about how basic spam e-mail classification models are built and how you too can make a startup that replaces Google, get acquired, then get shut down after securing your bag.

If you’ve ever worked with a small business that doesn’t have Gmail set up to host its e-mail services, you’ll know the struggle of hundreds of e-mails from totally anonymous people fishing for who knows what.

If you haven’t had this experience, imagine that you’re really excited to receive client inquiries for custom dresses and sizing. Unfortunately, 25% of the e-mails you get are complete spam. These solicitors tend to be completely harmless, but boy are they annoying. Here’s an example:

from: SHOPIFY EXPERT ahmedi*************@gmail.com[1]

subject:

content: If I could help your store achieve 8–10 consistent revenue daily without relying heavily on paid ads, would you be available to a brief walkthrough of the exact system that’s already generating results for other stores?

At first read, 8-10 incremental daily sales without relying on paid ads would be amazing! How much would you normally pay for that? And this guy would do this for us with no guarantees?! Amazing, sign me up. I don’t know what game they’re actually playing because I’ve never responded, but it reeks of SCAM

Buzz and Woody find Scams Everywhere

Gmail tends to do a pretty decent job of filtering out real spam. Unfortunately, the solicitors tend to also format their e-mails in ways that don’t look entirely like spam and might even be a reasonable outreach attempt (if I didn’t know better). Gmail is also EXPENSIVE. Now you might say, “but Varun Gmail is so cheap it’s only $20/month,” but when you’re bootstraping you make little sacrifies on quality of life issues[2] because literally every dollar counts and can be used effectively for other things. For example, properly tuned outreach ads can cost $0.01 to $0.03 per view.

As the good little fiancé I am, I decided to use basic logic statistics classification MACHINE LEARNING ✨AI✨ to filter out spam intelligently.

contents

the process

People explore and process data in so many different ways. I always find it helpful to understand how people planned their research before they start. Here’s mine:

  1. Define the problem. Identify possible solutions
  2. Find and archive the relevant data
  3. Explore data to see if any patterns or features stick out
  4. Decide on a model
  5. Train the model
  6. Test the model
  7. Time for production, baby

some disclaimers

Okay, now onto the actual work!

defining the problem and solution

Spam is annoying. Every spam e-mail wastes brainspace and bothers us. It distracts from doing real productive things. To deal with the spam, we can either ignore all notifications or manually review each message to classify them and delete those considered to be spam.

OR, I can apply some ✨AI✨ MACHINE LEARNING statistical classification algorithms and basic logic to realize the Pareto Principle automatically. A successful outcome would be a model that can screen each received e-mail for spamminess and sucessfully decide between “spam” and “not spam.” Of course, since we’re receiving client e-mails too, it would be far worse for our model to mistakenly classify a client e-mail as spam than for the model to classiy a spam e-mail as ham. We need to make sure that our errors are biased towards false negatives (Type II) instead of false positives (Type I).

getting data (corpus?[3] )

Our raw data is e-mails. I have no interest in manually scraping, so I’ll use Python’s convenience IMAP client libraries to fetch raw data. I have my own wrapper implementations that make this kind of scraping easier, but you can fetch data however you like. For example, I’ve heard that the Enron e-mail corpus is a freely available database of 600k e-mails. Apparently it’s difficult to get access to this kind of private e-mail data unless you’re Google, so this could be interesting to you.

click to expand and display the IMAP download sequence
    messages = {
        'spam': [],
        'ham': []
    }

    with mail.IMAPClient(mail.IMAP_SERVER, mail.IMAP_PORT, user, password) as client:
        client.client.select('inbox')

        typ, data = client.client.search(None, 'ALL')
        email_ids = data[0].split()

        print(f"found {len(email_ids)} emails.")

        for num in tqdm.tqdm(email_ids, 'fetching emails!'):
            typ, data = client.client.fetch(num, '(RFC822)')
            for response_part in data:
                if isinstance(response_part, tuple):
                    msg = email.message_from_bytes(response_part[1])
                    messages['ham'].append(msg)

    with mail.IMAPClient(mail.IMAP_SERVER, mail.IMAP_PORT, user, password) as client:
        client.client.select('spam_training')

        typ, data = client.client.search(None, 'ALL')
        email_ids = data[0].split()

        print(f"found {len(email_ids)} emails.")

        for num in tqdm.tqdm(email_ids, 'fetching emails!'):
            typ, data = client.client.fetch(num, '(RFC822)')
            for response_part in data:
                if isinstance(response_part, tuple):
                    msg = email.message_from_bytes(response_part[1])
                    messages['spam'].append(msg)

The basic fetching algorithm that I use is to download every e-mail from the main inbox and a folder named spam_training.

Then I extract the message body and the non-body items in the message (i.e. all of the e-mail headers). You might notice in stats below that the total e-mail counts are less than 1,000. Since I was only able to manually classify a couple hundred e-mails before my fingers fell off, I constrained the non-spam content to random sample of the e-mails to get close-ish to matching the number of e-mails that I classified. In doing this, I create a more “balanced” dataset that allows stastical machine learning algorithms to focus in on the gap between non-spam and spam classes. If we used the full dataset, there would be far more real e-mails than spam e-mails and our models would need to learn on a very limited dataset.

At this point, I have a pandas DataFrame where every e-mail header is a column and the e-mail body a column with NLTK stopwords excluded. Time to do some explorin'

click for the code for cleaning documents into a pandas DataFrame
    rows = []
    for m in messages['ham']:
        row = dict(m.items())
        row['body'] = mail.get_body_robust(m)
        row['is_spam'] = False
        rows.append(row)

    for m in messages['spam']:
        row = dict(m.items())
        row['body'] = mail.get_body_robust(m)
        row['is_spam'] = True
        rows.append(row)

    df = pdut.clean_cols(pd.DataFrame(rows).dropna(how='all', axis=1))
    stop_words = spam.get_stopwords('english')
    df['clean_body'] = df['body'].apply(clean_text, stop_words=stop_words)
    df['doc'] = df[['clean_body', 'subject']].apply(lambda x: json.dumps(x.to_json()), axis=1)

exploratory data analysis

I’m a bit opinionated when it comes to statistics. I think a little subject matter expertise goes a long way in finding predictive features. If you’re not personally an expert, then that’s fine, just go on social media and try to find one to learn from for a bit. You can use that expertise and learning to explore the data without wasting a lot of time.

Fortunately, I have personal experience seeing the spam that Jessie receives. There’s some patterns that jump out.

1. subject line

First, I kept seeing e-mails with no subject. Weird.

Who sends e-mails without a subject line?

If I look at the actual distribution of spam v. non-spam e-mails, you can see that 65% (88 / (88 +134)) of the spam has no subject. If we can find a way to identify empty subject e-mails that are spam, we’ll remove more than half of the spam. Conveniently, only 0.65% (88 / (88 +134)) of the non-spam has no subject[4]. This means we can pretty confidently say “if the e-mail has no subject, it’s spam.” Silly spammers, making this too easy, I don’t even need to whip out the ✨AI✨ yet.

not spamis spam
has subj764134
empty subj588

2. from address

Second, I notice that most of the spam comes from new gmail.com addresses. Using the same lens as the empty subject exploration, 95% of the spam comes from an email ending in @gmail.com. If we can find a way to identify gmail addresses that are spammers, we’ll remove 95% of the spam. Unfortunately, while most spam is from Gmail, many Gmail e-mails (32%, or 97 out of the total 97 + 210 Gmail) are decidely NOT spam. Many of these are e-mails from me to Jessie, so I hope she doesn’t think I’m spam!

not spamis spam
not Gmail67212
is Gmail97210

3. bro sign up for my course (free dictionary included)

Third, I vaguely noticed that a lot of these spam e-mails use the same rough vocabulary. Why? Maybe because there’s a video out there called Cold Calling For Beginners: A Step-by-Step Guide to Book Sales Meetings, or because there are more people selling cold calling courses than there are cold calls to make? Don’t even get me started on how there are 10x more courses on how to get started in profitable real estate flipping than there are actual opportunities to profitably flip a house.

Who sends e-mails without a subject line?

My personal angst at telemarketers (and now e-mail marketers) aside, the point is that the vocabulary that they use is drab, rehearsed, copy-pasted, automated, etc. It’s certainly not ✨AI✨ generated using the state of the art LLMs. I suspect this is because if the e-mail were too good then they would attract a higher proportion of qualified leads and these spammers mostly make money on the UN-qualified leads (i.e. how the Nigerian prince scam works. I’ll probably write about this sometime). Ok back to the topic at hand: if you make a wordcloud of what is in the e-mails I classify as spam, it’s immediately clear that there really are a lot of repeats.

want to convert he most sales for store for no commission? just jumble these words

It looks like somebody is telling spammers to jumble together a mix of “conversion”, “product”, “sale”, “store”, “traffic”, “partnership”, “customer”, and “result”. I really hope that the spammers aren’t paying for a course to learn how to do this.

4. “quick question for you, varunhegde.com”

I swear ✨AI✨ is either going to make spamming way easier or way way dumber. If you noticed in the wordcloud above, the spammers just love to say “jessiereneenyc” all over the place. When you read the actual e-mails, it looks something like this:

Hi Jessiereneenyc, Out of curiosity, are most of your sales coming from returning customers, or are you focusing on bringing in new visitors right now?

No normal person writes an e-mail like this I hope. The data confirms this too:

not spamis spam
baseline61165
“jessiereneenyc”158157

constructing features

Ok, after all this, we have 4-ish feature ideas. My gut ranks them like so:

  1. empty subject –> almost definitely spam
  2. uses the exact text “jessiereneenyc” –> almost definitely spam
  3. generic vocabulary matching probably implies spam
  4. from a Gmail address –> often spam, also encompasses most of the spam

features 1, 2, and 4 are easy features to build, here’s the code:

    feats['no_subject'] = (feats['subject'].str.len() == 0).astype(int)
    feats['has_jessiereneenyc'] = (feats['clean_body'].str.contains('jessiereneenyc') |
                                feats['subject'].str.lower().str.contains('jessiereneenyc')).astype(int)
    feats['is_gmail'] = feats['from'].str.contains('@gmail.com').astype(int)

feature 3 is a little harder, and we’ll rely on proper machine learning techniques to build it. Since we know that the feature should basically map to “similarity to spammy e-mail vocabulary”, if we can construct a generic model for spammy vocabulary we’re in a great spot. Fortunately, some very smart people have developed lots of really impressive models to do exactly this.

I’ll use a technique called Term Frequency-Inverse Document Frequency (TF-IDF) Vectorization to transform each e-mail’s content into a vector that represents how frequently each word in each e-mail is used. The “term-frequency” (TF) of a word in a document is just the relative frequency of that term in that document. In layman’s terms TF for “apple” is computed by counting all the times “apple” is in a document and then dividing by the number of words in the document. Inverse document frequency (IDF) is how common a word is across all of the documnets in the provided dataset. Again using the “apple” example, divide the total number of documents by the number of documents that “apple” shows up in at least once. Take the log of that number. TF-IDF then is just the TF x IDF for a specific word.

All of this math seems onerous, but again, smarter minds have done the hard work for me. sci-kit-learn in Python has an implementation called TfidfVectorizer that will generate a matrix of tf-idf terms for a specific set of documents. One line of code!

    vectorizer = TfidfVectorizer(max_features=500)
    tfidf_feats = vectorizer.fit_transform(feats['doc'])

training and testing the model

I now have some heuristic understanding, some labelled data, basic features, some fancier features. I can finally train a model!

WRONG, I have to pick which ones to try first. There are so many to choose from, but if you follow the handy dandy guide that sci-kit-learn provides you don’t even need to know the pros and cons of the various ones!

wow flow charts amazing

If you follow the flow chart:

What that leaves behind is Naive Bayes classifiers. I’ll use a Multinomial Bayes model because the features are all binomial (my special features) or multinomial (tf-idf feats) by definition all effectively take on counts or frequencies.

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    other_feats = ['no_subject', 'has_jessiereneenyc', 'is_gmail']
    if len(other_feats):
        X = sp.hstack([tfidf_feats, sp.csr_array(feats[other_feats])])
    else:
        X = tfidf_feats

    y = feats['is_spam'].astype(int)

    model = MultinomialNB()
    model.fit(X_train, y_train)


    feats['is_spam_pred'] = pd.Series(model.predict(X), index=df.index).astype(bool)
    y_pred = feats.loc[y_test.index, 'is_spam_pred']

results

TrainingTesting
Not SpamIs SpamNot SpamIs Spam
Not Predicted599181493
Predicted Spam7168245
Accuracy96.8%97.5%
Precision90.3%93.8%
F1 Score93.1%94.7%

I trained the model on 80% of the data and then tested against the remaining 20%. Unlike in the hedge fund world, I don’t need to worry too much about forward looking bias: the spammers are hopefully not changing their methods over time to extract alpha from the spam market.

The results speak for themselves:

Importantly, the false positive rate is pretty low at 1%. Do you think a person that has to sort through 200 e-mails will miss more than 2 e-mails becuase they have to wade through 50 spam e-mails?

Now, that same person only has to review 148 e-mails and they can be 98% confident that they’re not missing anything important.

running in production?

My rinky dink spam detection model trained on a random sample of 992 e-mails classifies does a relatively phenomenal job for a short Friday evening hacknight. Probably took me longer to write this post than it did to do the analysis.

Now all I need to do is scrape together a service that iteratively moves high probability spam e-mails into a “likely-spam” folder and then continuing to retrain over time with new data.

footnotes

[1] - I’m actually censoring the e-mail here so that they don’t find out I’m making fun of them in case they’re malicious. Ironically, it also prevents them from receiving spam from scraper bots in return.

[2] - Jessie, if you’re reading this, I know you plan to get Gmail set up as soon as it makes sense :)

[3] - this is such a fun word. so domain specific but it does make me feel real fancy to use it. I knew the SAT would come in handy some day

[4] - If I actually go to check those e-mails they actually are spam too, I’m just too lazy to reclassify them.

#Spam #Machine Learning #Diy #Internet #Email #Shopify #Privacy #Statistics #Supervised Models #Classification

comments powered by Disqus