A Quick Business Discussion

Published 2026-01-17 | Last Updated 2026-02-04

Some of you might know that my fiancée is a small business owner on top of the many hats that she wears. Today’s post is about a small part of that experience that I’m familiar with. Read along to learn something about how basic spam e-mail classification models are built and how you too can make a startup that replaces Google, get acquired, then get shut down after securing your bag.

If you’ve ever worked with a small business that doesn’t have Gmail set up to host its e-mail services, you’ll know the struggle of hundreds of e-mails from totally anonymous people fishing for who knows what.

If you haven’t had this experience, imagine that you’re really excited to receive client inquiries for custom dresses and sizing. Unfortunately, 25% of the e-mails you get are complete spam. These solicitors tend to be completely harmless, but boy are they annoying. Here’s an example:

from: SHOPIFY EXPERT ahmedi*************@gmail.com^[1]
subject:
content: If I could help your store achieve 8–10 consistent revenue daily without relying heavily on paid ads, would you be available to a brief walkthrough of the exact system that’s already generating results for other stores?

At first read, 8-10 incremental daily sales without relying on paid ads would be amazing! How much would you normally pay for that? And this guy would do this for us with no guarantees?! Amazing, sign me up. I don’t know what game they’re actually playing because I’ve never responded, but it reeks of SCAM

Buzz and Woody find Scams Everywhere

Gmail tends to do a pretty decent job of filtering out real spam. Unfortunately, the solicitors tend to also format their e-mails in ways that don’t look entirely like spam and might even be a reasonable outreach attempt (if I didn’t know better). Gmail is also EXPENSIVE. Now you might say, “but Varun Gmail is so cheap it’s only $20/month,” but when you’re bootstraping you make little sacrifies on quality of life issues^[2] because literally every dollar counts and can be used effectively for other things. For example, properly tuned outreach ads can cost $0.01 to $0.03 per view.

As the good little fiancé I am, I decided to use ~~basic logic~~ ~~statistics~~ ~~classification~~ ~~MACHINE LEARNING~~ ✨AI✨ to filter out spam intelligently.

the process

People explore and process data in so many different ways. I always find it helpful to understand how people planned their research before they start. Here’s mine:

Define the problem. Identify possible solutions
Find and archive the relevant data
Explore data to see if any patterns or features stick out
Decide on a model
Train the model
Test the model
Time for production, baby

some disclaimers

you can skip the next section and go right to results if you don’t care too much for the actual machine learning aspects.
If you think that I’m going to explore state of the art ✨AI✨, then I’m sorry to burst your bubble. I’m sure there are people out there that are using ✨AI✨ to solve the spam problem. They’re probably much smarter than me, and also they probably have a budget that eclipses the national GDP of Tuvalu. My budget is exactly $0, not including the cost to purchase and power overkill home build PC. I might write about the saga of my computer some day, but suffice to say I bought it in college from a buddy and I’ve made substantial improvements over time. Somehow I’m still not putting it through it’s paces though.

Okay, now onto the actual work!

defining the problem and solution

Spam is annoying. Every spam e-mail wastes brainspace and bothers us. It distracts from doing real productive things. To deal with the spam, we can either ignore all notifications or manually review each message to classify them and delete those considered to be spam.

OR, I can apply some ~~✨AI✨~~ ~~MACHINE LEARNING~~ statistical classification algorithms and basic logic to realize the Pareto Principle automatically. A successful outcome would be a model that can screen each received e-mail for spamminess and sucessfully decide between “spam” and “not spam.” Of course, since we’re receiving client e-mails too, it would be far worse for our model to mistakenly classify a client e-mail as spam than for the model to classiy a spam e-mail as ham. We need to make sure that our errors are biased towards false negatives (Type II) instead of false positives (Type I).

getting data (corpus?^[3] )

Our raw data is e-mails. I have no interest in manually scraping, so I’ll use Python’s convenience IMAP client libraries to fetch raw data. I have my own wrapper implementations that make this kind of scraping easier, but you can fetch data however you like. For example, I’ve heard that the Enron e-mail corpus is a freely available database of 600k e-mails. Apparently it’s difficult to get access to this kind of private e-mail data unless you’re Google, so this could be interesting to you.

click to expand and display the IMAP download sequence

    messages = {
        'spam': [],
        'ham': []
    }

    with mail.IMAPClient(mail.IMAP_SERVER, mail.IMAP_PORT, user, password) as client:
        client.client.select('inbox')

        typ, data = client.client.search(None, 'ALL')
        email_ids = data[0].split()

        print(f"found {len(email_ids)} emails.")

        for num in tqdm.tqdm(email_ids, 'fetching emails!'):
            typ, data = client.client.fetch(num, '(RFC822)')
            for response_part in data:
                if isinstance(response_part, tuple):
                    msg = email.message_from_bytes(response_part[1])
                    messages['ham'].append(msg)

    with mail.IMAPClient(mail.IMAP_SERVER, mail.IMAP_PORT, user, password) as client:
        client.client.select('spam_training')

        typ, data = client.client.search(None, 'ALL')
        email_ids = data[0].split()

        print(f"found {len(email_ids)} emails.")

        for num in tqdm.tqdm(email_ids, 'fetching emails!'):
            typ, data = client.client.fetch(num, '(RFC822)')
            for response_part in data:
                if isinstance(response_part, tuple):
                    msg = email.message_from_bytes(response_part[1])
                    messages['spam'].append(msg)

The basic fetching algorithm that I use is to download every e-mail from the main inbox and a folder named spam_training.

Then I extract the message body and the non-body items in the message (i.e. all of the e-mail headers). You might notice in stats below that the total e-mail counts are less than 1,000. Since I was only able to manually classify a couple hundred e-mails before my fingers fell off, I constrained the non-spam content to random sample of the e-mails to get close-ish to matching the number of e-mails that I classified. In doing this, I create a more “balanced” dataset that allows stastical machine learning algorithms to focus in on the gap between non-spam and spam classes. If we used the full dataset, there would be far more real e-mails than spam e-mails and our models would need to learn on a very limited dataset.

At this point, I have a pandas DataFrame where every e-mail header is a column and the e-mail body a column with NLTK stopwords excluded. Time to do some explorin'

click for the code for cleaning documents into a pandas DataFrame

    rows = []
    for m in messages['ham']:
        row = dict(m.items())
        row['body'] = mail.get_body_robust(m)
        row['is_spam'] = False
        rows.append(row)

    for m in messages['spam']:
        row = dict(m.items())
        row['body'] = mail.get_body_robust(m)
        row['is_spam'] = True
        rows.append(row)

    df = pdut.clean_cols(pd.DataFrame(rows).dropna(how='all', axis=1))
    stop_words = spam.get_stopwords('english')
    df['clean_body'] = df['body'].apply(clean_text, stop_words=stop_words)
    df['doc'] = df[['clean_body', 'subject']].apply(lambda x: json.dumps(x.to_json()), axis=1)

exploratory data analysis

I’m a bit opinionated when it comes to statistics. I think a little subject matter expertise goes a long way in finding predictive features. If you’re not personally an expert, then that’s fine, just go on social media and try to find one to learn from for a bit. You can use that expertise and learning to explore the data without wasting a lot of time.

Fortunately, I have personal experience seeing the spam that Jessie receives. There’s some patterns that jump out.

1. subject line

First, I kept seeing e-mails with no subject. Weird.

Who sends e-mails without a subject line?

If I look at the actual distribution of spam v. non-spam e-mails, you can see that 65% (88 / (88 +134)) of the spam has no subject. If we can find a way to identify empty subject e-mails that are spam, we’ll remove more than half of the spam. Conveniently, only 0.65% (88 / (88 +134)) of the non-spam has no subject^[4]. This means we can pretty confidently say “if the e-mail has no subject, it’s spam.” Silly spammers, making this too easy, I don’t even need to whip out the ✨AI✨ yet.

	not spam	is spam
has subj	764	134
empty subj	5	88

2. from address

Second, I notice that most of the spam comes from new gmail.com addresses. Using the same lens as the empty subject exploration, 95% of the spam comes from an email ending in @gmail.com. If we can find a way to identify gmail addresses that are spammers, we’ll remove 95% of the spam. Unfortunately, while most spam is from Gmail, many Gmail e-mails (32%, or 97 out of the total 97 + 210 Gmail) are decidely NOT spam. Many of these are e-mails from me to Jessie, so I hope she doesn’t think I’m spam!

	not spam	is spam
not Gmail	672	12
is Gmail	97	210

Third, I vaguely noticed that a lot of these spam e-mails use the same rough vocabulary. Why? Maybe because there’s a video out there called Cold Calling For Beginners: A Step-by-Step Guide to Book Sales Meetings, or because there are more people selling cold calling courses than there are cold calls to make? Don’t even get me started on how there are 10x more courses on how to get started in profitable real estate flipping than there are actual opportunities to profitably flip a house.

Who sends e-mails without a subject line?

My personal angst at telemarketers (and now e-mail marketers) aside, the point is that the vocabulary that they use is drab, rehearsed, copy-pasted, automated, etc. It’s certainly not ✨AI✨ generated using the state of the art LLMs. I suspect this is because if the e-mail were too good then they would attract a higher proportion of qualified leads and these spammers mostly make money on the UN-qualified leads (i.e. how the Nigerian prince scam works. I’ll probably write about this sometime). Ok back to the topic at hand: if you make a wordcloud of what is in the e-mails I classify as spam, it’s immediately clear that there really are a lot of repeats.

want to convert he most sales for store for no commission? just jumble these words

It looks like somebody is telling spammers to jumble together a mix of “conversion”, “product”, “sale”, “store”, “traffic”, “partnership”, “customer”, and “result”. I really hope that the spammers aren’t paying for a course to learn how to do this.

4. “quick question for you, varunhegde.com”

I swear ✨AI✨ is either going to make spamming way easier or way way dumber. If you noticed in the wordcloud above, the spammers just love to say “jessiereneenyc” all over the place. When you read the actual e-mails, it looks something like this:

Hi Jessiereneenyc, Out of curiosity, are most of your sales coming from returning customers, or are you focusing on bringing in new visitors right now?

No normal person writes an e-mail like this I hope. The data confirms this too:

	not spam	is spam
baseline	611	65
“jessiereneenyc”	158	157

constructing features

Ok, after all this, we have 4-ish feature ideas. My gut ranks them like so:

empty subject –> almost definitely spam
uses the exact text “jessiereneenyc” –> almost definitely spam
generic vocabulary matching probably implies spam
from a Gmail address –> often spam, also encompasses most of the spam

features 1, 2, and 4 are easy features to build, here’s the code:

    feats['no_subject'] = (feats['subject'].str.len() == 0).astype(int)
    feats['has_jessiereneenyc'] = (feats['clean_body'].str.contains('jessiereneenyc') |
                                feats['subject'].str.lower().str.contains('jessiereneenyc')).astype(int)
    feats['is_gmail'] = feats['from'].str.contains('@gmail.com').astype(int)

feature 3 is a little harder, and we’ll rely on proper machine learning techniques to build it. Since we know that the feature should basically map to “similarity to spammy e-mail vocabulary”, if we can construct a generic model for spammy vocabulary we’re in a great spot. Fortunately, some very smart people have developed lots of really impressive models to do exactly this.

I’ll use a technique called Term Frequency-Inverse Document Frequency (TF-IDF) Vectorization to transform each e-mail’s content into a vector that represents how frequently each word in each e-mail is used. The “term-frequency” (TF) of a word in a document is just the relative frequency of that term in that document. In layman’s terms TF for “apple” is computed by counting all the times “apple” is in a document and then dividing by the number of words in the document. Inverse document frequency (IDF) is how common a word is across all of the documnets in the provided dataset. Again using the “apple” example, divide the total number of documents by the number of documents that “apple” shows up in at least once. Take the log of that number. TF-IDF then is just the TF x IDF for a specific word.

All of this math seems onerous, but again, smarter minds have done the hard work for me. sci-kit-learn in Python has an implementation called TfidfVectorizer that will generate a matrix of tf-idf terms for a specific set of documents. One line of code!

    vectorizer = TfidfVectorizer(max_features=500)
    tfidf_feats = vectorizer.fit_transform(feats['doc'])

training and testing the model

I now have some heuristic understanding, some labelled data, basic features, some fancier features. I can finally train a model!

WRONG, I have to pick which ones to try first. There are so many to choose from, but if you follow the handy dandy guide that sci-kit-learn provides you don’t even need to know the pros and cons of the various ones!

wow flow charts amazing

If you follow the flow chart:

I have more than 50 samples (N=~900, maybe I’ll pull in more e-mails if I get bored)
I am predicting a category (spam or ham)
I do have labeled data
I have less than 100k samples (no SGD or kernel approximation for me)
the data includes text-based features

What that leaves behind is Naive Bayes classifiers. I’ll use a Multinomial Bayes model because the features are all binomial (my special features) or multinomial (tf-idf feats) by definition all effectively take on counts or frequencies.

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    other_feats = ['no_subject', 'has_jessiereneenyc', 'is_gmail']
    if len(other_feats):
        X = sp.hstack([tfidf_feats, sp.csr_array(feats[other_feats])])
    else:
        X = tfidf_feats

    y = feats['is_spam'].astype(int)

    model = MultinomialNB()
    model.fit(X_train, y_train)


    feats['is_spam_pred'] = pd.Series(model.predict(X), index=df.index).astype(bool)
    y_pred = feats.loc[y_test.index, 'is_spam_pred']

results

	Training		Testing
	Not Spam	Is Spam	Not Spam	Is Spam
Not Predicted	599	18	149	3
Predicted Spam	7	168	2	45
Accuracy	96.8%		97.5%
Precision	90.3%		93.8%
F1 Score	93.1%		94.7%

I trained the model on 80% of the data and then tested against the remaining 20%. Unlike in the hedge fund world, I don’t need to worry too much about forward looking bias: the spammers are hopefully not changing their methods over time to extract alpha from the spam market.

The results speak for themselves:

I can correctly guess spam v. ham 97.5% of the time OUT OF SAMPLE
I mistakenly classify 6% of true spam as ham: that’s not so bad
I mistakenly classify 1.3% of true ham as spam.

Importantly, the false positive rate is pretty low at 1%. Do you think a person that has to sort through 200 e-mails will miss more than 2 e-mails becuase they have to wade through 50 spam e-mails?

Now, that same person only has to review 148 e-mails and they can be 98% confident that they’re not missing anything important.

running in production?

My rinky dink spam detection model trained on a random sample of 992 e-mails classifies does a relatively phenomenal job for a short Friday evening hacknight. Probably took me longer to write this post than it did to do the analysis.

Now all I need to do is scrape together a service that iteratively moves high probability spam e-mails into a “likely-spam” folder and then continuing to retrain over time with new data.

footnotes

[1] - I’m actually censoring the e-mail here so that they don’t find out I’m making fun of them in case they’re malicious. Ironically, it also prevents them from receiving spam from scraper bots in return.

[2] - Jessie, if you’re reading this, I know you plan to get Gmail set up as soon as it makes sense :)

[3] - this is such a fun word. so domain specific but it does make me feel real fancy to use it. I knew the SAT would come in handy some day

[4] - If I actually go to check those e-mails they actually are spam too, I’m just too lazy to reclassify them.

#Spam #Machine Learning #Diy #Internet #Email #Shopify #Privacy #Statistics #Supervised Models #Classification

Over the Hedge

A Quick Business Discussion

contents

the process

some disclaimers

defining the problem and solution

getting data (corpus?^[3] )

exploratory data analysis

1. subject line

2. from address

4. “quick question for you, varunhegde.com”

constructing features

training and testing the model

results

running in production?

footnotes

A Quick Business Discussion

contents

the process

some disclaimers

defining the problem and solution

getting data (corpus?[3] )

exploratory data analysis

1. subject line

2. from address

3. bro sign up for my course (free dictionary included)

4. “quick question for you, varunhegde.com”

constructing features

training and testing the model

results

running in production?

footnotes

getting data (corpus?^[3] )