<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Finance on Over the Hedge</title><link>https://varunhegde.com/blog/finance/</link><description>Recent content in Finance on Over the Hedge</description><generator>Hugo</generator><language>en-US</language><managingEditor>varun@varunhegde.com (Varun Hegde)</managingEditor><webMaster>varun@varunhegde.com (Varun Hegde)</webMaster><copyright>Copyright © 2025, Varun Hegde.</copyright><lastBuildDate>Fri, 19 Jun 2026 17:13:16 -0400</lastBuildDate><atom:link href="https://varunhegde.com/blog/finance/index.xml" rel="self" type="application/rss+xml"/><item><title>You Can't Eat Sharpe</title><link>https://varunhegde.com/you-cant-eat-sharpe/</link><pubDate>Wed, 06 May 2026 20:24:19 -0400</pubDate><author>varun@varunhegde.com (Varun Hegde)</author><guid>https://varunhegde.com/you-cant-eat-sharpe/</guid><description>&lt;p>&lt;img src="https://varunhegde.com/images/sharpe2.png#center" alt="Sharpe GF">&lt;/p>
&lt;p>My brain is probably a little messed up from spending my whole career in trading. I don&amp;rsquo;t mean that it doesn&amp;rsquo;t work, I just mean that I often find little financial hacks that seem like a good idea at the time because the margins justify it. When I actually go ahead and do the trade, of course, it often becomes clear that I missed something crucial: &lt;strong>You Can&amp;rsquo;t Eat Sharpe&lt;/strong>. Before we get into the dumb trade that I did, I want to do a bit of a primer on some basic theory that might be helpful.&lt;/p>
&lt;p>If you already know what a Sharpe ratio is and you dream (nightmare?) of SOFR when you sleep, skip ahead to &lt;a href="#fine-dining">Fine Dining&lt;/a> for the meat and potatoes&lt;/p>
&lt;h2 id="a-primer-on-the-sharpe-ratio">A Primer on the Sharpe Ratio&lt;/h2>
&lt;p>In 1966, William F. Sharpe wrote a paper on measuring &amp;ldquo;Mutual Fund Performance&amp;rdquo;&lt;sup>&lt;a href="#f1">[1]&lt;/a>&lt;/sup>. Plenty of people had tried this before (after all, the first U.S. based mutual fund started in 1924), but this paper has been cited so many times that it&amp;rsquo;s worth visiting even just to understand history.&lt;/p>
&lt;p>The key idea that Sharpe explored is that comparing equity mutual funds of the 1960s was comparing apples to oranges. Technically all mutual funds are fruit, but there&amp;rsquo;s no guarantee that two portfolio managers have the same properties.&lt;/p>
&lt;p>This creates a problem:&lt;/p>
&lt;ol>
&lt;li>Ivan the Investor wants to invest in the &lt;strong>best&lt;/strong> investment. He wants to make the most money possible, but he&amp;rsquo;s also scared of taking too much risk and losing the money he has scrimped to save&lt;/li>
&lt;li>Ivan can invest his whole account in one of two funds, the Apple Always Allocation Fund (AAA) and the Buy Beanie Babies Fund (BBB)&lt;/li>
&lt;li>assume the AAA fund returns roughly 10% per year, with 16% expected annual standard deviation&lt;/li>
&lt;li>assume the BBB fund returns roughly 40% per year with 100% expected annual standard deviation&lt;/li>
&lt;li>assume AAA and BBB are completely uncorrelated&lt;sup>&lt;a href="#f2">[2]&lt;/a>&lt;/sup>&lt;/li>
&lt;/ol>
&lt;h3 id="which-fund-does-ivan-pick">Which fund does Ivan pick?&lt;/h3>
&lt;p>As it stands, the return maximizer with infinite holding horizon is of course going to pick the higher return despite the higher volatility - 40% per year sounds a lot better than only 10% per year. So BBB is the pick.&lt;/p>
&lt;h3 id="which-fund-should-ivan-pick">Which fund SHOULD Ivan pick?&lt;/h3>
&lt;p>In the real world, you can borrow money. In Sharpe&amp;rsquo;s idealized world, you get to borrow at the magical risk free rate. If you live in the United States, you can achieve something of a risk free rate by borrowing at the Securited Overnight Financing Rate (SOFR)&lt;sup>&lt;a href="#f3">[3]&lt;/a>&lt;/sup>. The point is that various repo counterparties borrowed/lent $3.15T overnight on May 5th at roughly 3.62% annualized (i.e. pay $1/day to borrow $10000). We&amp;rsquo;ll use this rate for our example but in reality you&amp;rsquo;ll probably pay some spread to SOFR.&lt;/p>
&lt;p>Now which fund does Ivan pick? Let&amp;rsquo;s run through each of the 3 scenarios below&lt;/p>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>&lt;/th>
&lt;th>&lt;strong>Fund AAA&lt;/strong>&lt;/th>
&lt;th>&lt;strong>Fund BBB&lt;/strong>&lt;/th>
&lt;th>&lt;strong>Fund AAA x6.25&lt;/strong>&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>fund % return&lt;/td>
&lt;td>10%&lt;/td>
&lt;td>40%&lt;/td>
&lt;td>10%&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>fund % volatility&lt;/td>
&lt;td>16%&lt;/td>
&lt;td>100%&lt;/td>
&lt;td>16%&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>starting equity&lt;/td>
&lt;td>$100&lt;/td>
&lt;td>$100&lt;/td>
&lt;td>$100&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>leverage ratio&lt;/td>
&lt;td>1.00&lt;/td>
&lt;td>1.00&lt;/td>
&lt;td>6.25&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>assets&lt;/td>
&lt;td>$100&lt;/td>
&lt;td>$100&lt;/td>
&lt;td>$625&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>debt&lt;/td>
&lt;td>$0&lt;/td>
&lt;td>$0&lt;/td>
&lt;td>$525&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>SOFR&lt;/td>
&lt;td>3.62%&lt;/td>
&lt;td>3.62%&lt;/td>
&lt;td>3.62%&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>interest cost&lt;/td>
&lt;td>$0.00&lt;/td>
&lt;td>$0.00&lt;/td>
&lt;td>$19.01&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>gross $return&lt;/td>
&lt;td>$10.00&lt;/td>
&lt;td>$40.00&lt;/td>
&lt;td>$62.50&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>net $return&lt;/td>
&lt;td>$10.00&lt;/td>
&lt;td>$40.00&lt;/td>
&lt;td>$43.50&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>investor % return&lt;/td>
&lt;td>10%&lt;/td>
&lt;td>40%&lt;/td>
&lt;td>43.50%&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>investor % volatility&lt;/td>
&lt;td>16%&lt;/td>
&lt;td>100%&lt;/td>
&lt;td>100%&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>sharpe ratio&lt;/td>
&lt;td>0.435&lt;/td>
&lt;td>0.4&lt;/td>
&lt;td>0.435&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;p>If Ivan just does Fund AAA or Fund BBB, he makes the same amount of money as expected ($10 on a $100 investment, or $40 on a $1000 investment).&lt;/p>
&lt;p>On the other hand, if he &lt;strong>lever up&lt;/strong> fund AAA by borrowing $525 for every $100 invested, he can eke out extra return just for paying a little bit of interest. By borrowing money, he also increases his risk - the same 16% volatility on $625 in assets is now $100 in volatility (or 100% of the initial investment).&lt;/p>
&lt;p>The best part is that you don&amp;rsquo;t need to walk through this contrived example comparing the S&amp;amp;P 500 to spot Bitcoin (oops sorry I meant fund AAA and BBB) to be able to apply Sharpe&amp;rsquo;s work to other investments. You can simply compute the expected return minus the risk free financing rate and divide by expected risk. That number gives you a useful heuristic clean of the holy &amp;ldquo;risk adjusted return&amp;rdquo; that everybody on X seems to rave about.&lt;/p>
&lt;h3 id="a-brief-aside-on-leverage-and-risk-managment">A brief aside on leverage and risk managment&lt;/h3>
&lt;blockquote>
&lt;p>This is not investment advice. I happen to work somewhere where we do investing, but none of the views or opinions expressed here represent the views of my employer. These are solely my own opinions. I&amp;rsquo;d also broadly say that getting any kind of information from the internet is a bad idea. This is true if you&amp;rsquo;re going to commit your own capital to it, and even more true if you&amp;rsquo;re going to commit somebody else&amp;rsquo;s capital to it.&lt;/p>
&lt;/blockquote>
&lt;p>You may have noticed in the preceding example that Ivan levered up to 6.25x. For almost all traders, this is mostly a very bad idea because of margin calls and a magnified bankruptcy risk. Don&amp;rsquo;t do this.&lt;/p>
&lt;h2 id="the-fine-dining-experience">The Fine Dining Experience&lt;/h2>
&lt;p>Now that we&amp;rsquo;re all roughly acquainted with how leverage can allow you to pick the better &lt;em>risk-adjusted&lt;/em> investment we can get into why maximizing risk adjusted returns actually can&amp;rsquo;t be your only criteria.&lt;/p>
&lt;p>I had the bright idea in February &amp;lsquo;24 to participate in a Tax Lien auction. The rough trade idea is very simple. People own property. Local governments in the United States levy property taxes on said property to pay for services that private property owners tend to benefit from (public roads for their cars, public schools for their children, hospitals, human services, etc.). Often times, property owners forget to pay their property taxes. Bidding on a tax lien certificate is effectively bidding on the right to collect the homeowners property taxes from them plus some penalty and some interest. You basically are fronting them the cash for their property taxes until they can pay it.&lt;/p>
&lt;ul>
&lt;li>90% of the time, the owner has forgotten to pay and they pay the penalty and move on within a week.&lt;/li>
&lt;li>8% of the time, they choose not to pay. You have to send them a letter and they pay the penalty and some interest within a month.&lt;/li>
&lt;li>1.5% of the time, they can&amp;rsquo;t afford to pay because of circumstances in their control. For example, they bought a $2M house and then bought a Lambo the next month. This usually takes a while to resolve&lt;/li>
&lt;li>0.5% of the time, there are unavoidable circumstances. Who am I to judge?&lt;sup>&lt;a href="#f4">[4]&lt;/a>&lt;/sup>&lt;/li>
&lt;/ul>
&lt;p>There&amp;rsquo;s relatively low (zero?) risk of this overcollateralized financial instrument (loan?) resulting in a loss, and the returns are generally decent (benchmark 6-7% annual coupon rate). Also the duration risk is relatively low given that most trades complete within a month or two and even the extended ones last max 2y.&lt;/p>
&lt;p>This is exactly the kind of high Sharpe trade that should be massively levered, right?&lt;/p>
&lt;p>&lt;img src="https://varunhegde.com/images/sharpe.png#center" alt="Padme Leverage">&lt;/p>
&lt;p>This is where Sharpe isn&amp;rsquo;t the only metric that really matters at the end of the day&lt;/p>
&lt;h3 id="a-real-example">A Real Example&lt;/h3>
&lt;p>Below are the cashflows associated with the real auction that I did back in 2024. The IRR of almost 15% on fully collateralized notes with super-senior lien status is just absolutely mouthwatering. It&amp;rsquo;s also such a weirdly niche corner of finance that you don&amp;rsquo;t have a lot of institutional competitors crowding you out.&lt;/p>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>Date&lt;/th>
&lt;th>Cash Flow&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>3/26/2024&lt;/td>
&lt;td>$ (1,480)&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>3/26/2024&lt;/td>
&lt;td>$ (1,900)&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>3/26/2024&lt;/td>
&lt;td>$ (1,913)&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>3/26/2024&lt;/td>
&lt;td>$ (1,190)&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>3/26/2024&lt;/td>
&lt;td>$ (1,679)&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>5/1/2025&lt;/td>
&lt;td>$ 2,373&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>10/17/2024&lt;/td>
&lt;td>$ 1,628&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>4/11/2024&lt;/td>
&lt;td>$ 2,104&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>3/31/2026&lt;/td>
&lt;td>$ 1,368&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>3/24/2026&lt;/td>
&lt;td>$ 1,930&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>IRR&lt;/td>
&lt;td>&lt;strong>14.5%&lt;/strong>&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;p>The main risk here is that the tax payer decides to never pay back and you need to foreclose and the lawyer fees are worth more than the house that needs foreclosure. As it turns out, all of these properties were worth more than 100x the lien, so that&amp;rsquo;s an unlikely issue.&lt;/p>
&lt;p>On paper this is a sweet deal. It&amp;rsquo;s like picking up Sacagaweas off the floor in front of a steamroller except the steamroller is out of gas. I had fewer white hairs in my beard back then.&lt;/p>
&lt;h2 id="whats-the-problem">What&amp;rsquo;s the Problem?&lt;/h2>
&lt;p>You probably noticed that I only made $1241 after committing $8162 in capital for almost 2 years. Considering I financed 100% of this at SOFR, I basically clipped an IRR of 11.5% &lt;em>FOR FREE&lt;/em>. Why not do this &lt;em>ad inifitum&lt;/em> to make infinite money?&lt;/p>
&lt;p>Turns out that high Sharpe just means that you have a capital efficient trade. The big problem with this investment is that the total market issuance is only about $25B a year across the continental United States. That&amp;rsquo;s every single municipality in the country. My $8k was only 0.5% of the bids in the auction I participated in. I won&amp;rsquo;t get into my pricing algorithm for placing bids, but it does mean that if I was looking to make meaningful money (call it $1M), that I would need to commit to bidding in 100+ municipalities with similar scale, or pricing tighter so I capture more of the auction (and reduce my implied yield to maturity).&lt;/p>
&lt;p>It&amp;rsquo;s really really painful to navigate around even one township&amp;rsquo;s rules and regulations surrounding tax lien certificates. Don&amp;rsquo;t even get me started on expanding to hundreds.&lt;/p>
&lt;h2 id="parting-thoughts">Parting Thoughts&lt;/h2>
&lt;p>This could be an interesting trade for somebody that is willing to manage a whole portfolio of certs. If anybody has experience with similar-ish instruments feel free to reach out, I&amp;rsquo;m always interested to learn more and hear war stories.&lt;/p>
&lt;h2 id="footnotes">Footnotes&lt;/h2>
&lt;p>&lt;span id="f1">[1]&lt;/span> - The Journal of Business, Vol. 39, No. 1, Part 2: Supplement on Security Prices, pp. 119-138 &lt;a href="https://www.jstor.org/stable/2351741">JSTOR 2351741&lt;/a>&lt;/p>
&lt;p>&lt;span id="f2">[2]&lt;/span> - ask &lt;a href="https://www.tylervigen.com/spurious-correlations">Tyler Given&lt;/a> if there&amp;rsquo;s a spurious correlation between Apple, Inc. and Beanie Babies&lt;/p>
&lt;p>&lt;span id="f3">[3]&lt;/span> - I won&amp;rsquo;t go into the mechanics of how you can put on this trade, it&amp;rsquo;s not important. Maybe for a future piece&lt;/p>
&lt;p>&lt;span id="f4">[4]&lt;/span> - these are the folks that I actually feel for. It can be a very harrowing experience and often times paying your property taxes is the last thing that gets dropped&lt;/p></description></item><item><title>I Was the Spam All Along</title><link>https://varunhegde.com/i-was-the-spam-all-along/</link><pubDate>Sun, 19 Apr 2026 20:58:26 -0400</pubDate><author>varun@varunhegde.com (Varun Hegde)</author><guid>https://varunhegde.com/i-was-the-spam-all-along/</guid><description>&lt;p>This one is just going to be a short follow up on &lt;a href="https://varunhegde.com/a-quick-business-discussion">A Quick Business Discussion&lt;/a>. You might be wondering if I ended up running the model in production. I did.&lt;/p>
&lt;p>Initially, I started running the filter every 15 minutes, for any message sent in the last 7 days. If you&amp;rsquo;re familiar with low-bandwidth, low-cost IMAP servers, you might already see where this is about to go wrong.&lt;/p>
&lt;p>We got an e-mail that was immediately classifed as spam&amp;hellip;&lt;/p>
&lt;blockquote>
&lt;p>Immediate Action Required: Suspicious Download Activity Detected on Your Titan Account&lt;/p>
&lt;p>I am writing to address a matter concerning an unusual surge in downloads amounting to 58.3 GB in last 7 days originating from ****@j****e.com Titan account setup on a third-party email client. Unfortunately, this activity is categorized as suspicious behavior and signifies an unfair utilization of our account resources.&lt;/p>
&lt;/blockquote>
&lt;p>I followed up and gave the tech team a little more information. It wasn&amp;rsquo;t a scam or a botnet, it was just my naive filter re-downloading what must have been a very large file sent in the last week. They replied with the following:&lt;/p>
&lt;blockquote>
&lt;p>Yes, continuously scanning emails from the last week every 15 minutes would result in a high volume of downloads, which aligns with the activity we observed.&lt;/p>
&lt;p>Limiting the scope (e.g., only unread emails from the last 24 hours) would be a more reasonable approach.&lt;/p>
&lt;/blockquote>
&lt;p>First off, the Titan opsec teams are so nice. I started my career in a place where the default response to something this dumb would not be so nice. So kudos to Titan.&lt;/p>
&lt;p>Second off, the fix is so darn simple that all you had to do was check the log file (that was 100s of MB long when printing every e-mail subject) to understand what was going on.&lt;/p>
&lt;p>Actually, this reminds of a similar story involving Tibco Rendezvous multicasts.&lt;/p>
&lt;h2 id="a-brief-detour-on-dos-ing-systems">A brief detour on DOS-ing systems&lt;/h2>
&lt;p>I might be butchering the story, but I heard it almost 10 years ago so cut me some slack.&lt;/p>
&lt;p>A friend needed to ack messages recieved over a Tibco RendezVouz (TibRV) multicast because something wasn&amp;rsquo;t working. The ack was very simple: if recieve message, send &amp;ldquo;yo yo yo, this is ______&amp;rdquo;. The modern day equivalent of this is testing a full-duplex websocket by sending keep-alive messages.Should work. The problem is that said acking code was also listening to the same channel, so every recieved ack created another one in an infinite loop of UDP packets. UDP on the corporate intranet sure is speedy. So this causes massive bottleneck on the multicast channels and inevitably brings down multiple hosts. When the production support engineers go to check the logs to figure out what happened, they find &amp;ldquo;yo yo yo, this is ______&amp;rdquo;.&lt;sup>&lt;a href="#f1">[1]&lt;/a>&lt;/sup>&lt;/p>
&lt;h2 id="back-to-e-mail">Back to E-mail&lt;/h2>
&lt;p>As I said, the fix was so darn simple. The following IMAP query:&lt;/p>
&lt;blockquote>
&lt;p>since = dt.date.today()&lt;/p>
&lt;p>&amp;ldquo;SEARCH&amp;rdquo;, f&amp;rsquo;(SENTSINCE {since.strftime(&amp;quot;%d-%b-%Y&amp;quot;)})'&lt;/p>
&lt;/blockquote>
&lt;p>Needed to become a slightly more restrictive one:&lt;/p>
&lt;blockquote>
&lt;p>since = dt.date.today() - rdate(days=7)&lt;/p>
&lt;p>&amp;ldquo;SEARCH&amp;rdquo;, f&amp;rsquo;(UNSEEN SENTSINCE {since.strftime(&amp;quot;%d-%b-%Y&amp;quot;)})'&lt;/p>
&lt;/blockquote>
&lt;p>And voila, running every 15 minutes has stopped causing an issue! Until the next time that maybe we get an insanely large e-mail&lt;/p>
&lt;h2 id="footnotes">footnotes&lt;/h2>
&lt;p>&lt;span id="f1">[1]&lt;/span> - I promise this really wasn&amp;rsquo;t me, but also it&amp;rsquo;s a mentor that I respect a lot. Real engineering starts and ends with print statements so often and it&amp;rsquo;s fun to know that your mentors are human too&lt;/p></description></item><item><title>A Quick Business Discussion</title><link>https://varunhegde.com/a-quick-business-discussion/</link><pubDate>Sat, 17 Jan 2026 15:29:37 -0500</pubDate><author>varun@varunhegde.com (Varun Hegde)</author><guid>https://varunhegde.com/a-quick-business-discussion/</guid><description>&lt;p>Some of you might know that my fiancée is a small business owner on top of the many hats that she wears. Today&amp;rsquo;s post is about a small part of that experience that I&amp;rsquo;m familiar with. Read along to learn something about how basic spam e-mail classification models are built and how you too can make a startup that replaces Google, get acquired, then get shut down after securing your bag.&lt;/p>
&lt;p>If you&amp;rsquo;ve ever worked with a small business that doesn&amp;rsquo;t have Gmail set up to host its e-mail services, you&amp;rsquo;ll know the struggle of hundreds of e-mails from totally anonymous people fishing for who knows what.&lt;/p>
&lt;p>If you haven&amp;rsquo;t had this experience, imagine that you&amp;rsquo;re really excited to receive client inquiries for custom dresses and sizing. Unfortunately, 25% of the e-mails you get are complete spam. These solicitors tend to be completely harmless, but boy are they annoying. Here&amp;rsquo;s an example:&lt;/p>
&lt;blockquote>
&lt;p>from: SHOPIFY EXPERT &lt;a href="mailto:ahmedi*************@gmail.com">ahmedi*************@gmail.com&lt;/a>&lt;sup>&lt;a href="#f1">[1]&lt;/a>&lt;/sup>&lt;/p>
&lt;p>subject:&lt;/p>
&lt;p>content: If I could help your store achieve 8–10 consistent revenue daily without relying heavily on paid ads, would you be available to a brief walkthrough of the exact system that’s already generating results for other stores?&lt;/p>
&lt;/blockquote>
&lt;p>At first read, 8-10 incremental daily sales without relying on paid ads would be amazing! How much would you normally pay for that? And this guy would do this for us with no guarantees?! Amazing, sign me up. I don&amp;rsquo;t know what game they&amp;rsquo;re actually playing because I&amp;rsquo;ve never responded, but it reeks of &lt;strong>SCAM&lt;/strong>&lt;/p>
&lt;p>&lt;img src="https://varunhegde.com/images/spam/scams_meme1.webp#center" alt="Buzz and Woody find Scams Everywhere">&lt;/p>
&lt;p>Gmail tends to do a pretty decent job of filtering out &lt;em>real&lt;/em> spam. Unfortunately, the solicitors tend to also format their e-mails in ways that don&amp;rsquo;t look entirely like spam and might even be a reasonable outreach attempt (if I didn&amp;rsquo;t know better). Gmail is also &lt;strong>EXPENSIVE&lt;/strong>. Now you might say, &amp;ldquo;but Varun Gmail is so cheap it&amp;rsquo;s only $20/month,&amp;rdquo; but when you&amp;rsquo;re bootstraping you make little sacrifies on quality of life issues&lt;sup>&lt;a href="#f2">[2]&lt;/a>&lt;/sup> because literally every dollar counts and can be used effectively for other things. For example, properly tuned outreach ads can cost $0.01 to $0.03 per view.&lt;/p>
&lt;p>As the good little fiancé I am, I decided to use &lt;del>basic logic&lt;/del> &lt;del>statistics&lt;/del> &lt;del>classification&lt;/del> &lt;del>MACHINE LEARNING&lt;/del> ✨AI✨ to filter out spam intelligently.&lt;/p>
&lt;h1 id="contents">contents&lt;/h1>
&lt;nav id="TableOfContents">
&lt;ul>
&lt;li>&lt;a href="#contents">contents&lt;/a>&lt;/li>
&lt;li>&lt;a href="#the-process">the process&lt;/a>&lt;/li>
&lt;li>&lt;a href="#some-disclaimers">some disclaimers&lt;/a>&lt;/li>
&lt;li>&lt;a href="#defining-the-problem-and-solution">defining the problem and solution&lt;/a>&lt;/li>
&lt;li>&lt;a href="#getting-data-corpussup3f3sup-">getting data (corpus?&lt;sup>&lt;a href="#f3">[3]&lt;/a>&lt;/sup> )&lt;/a>&lt;/li>
&lt;li>&lt;a href="#exploratory-data-analysis">exploratory data analysis&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#1-subject-line">1. subject line&lt;/a>&lt;/li>
&lt;li>&lt;a href="#2-from-address">2. from address&lt;/a>&lt;/li>
&lt;li>&lt;a href="#3-bro-sign-up-for-my-course-free-dictionary-included">3. bro sign up for my course (free dictionary included)&lt;/a>&lt;/li>
&lt;li>&lt;a href="#4-quick-question-for-you-varunhegdecom">4. &amp;ldquo;quick question for you, varunhegde.com&amp;rdquo;&lt;/a>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#constructing-features">constructing features&lt;/a>&lt;/li>
&lt;li>&lt;a href="#training-and-testing-the-model">training and testing the model&lt;/a>&lt;/li>
&lt;li>&lt;a href="#results">results&lt;/a>&lt;/li>
&lt;li>&lt;a href="#running-in-production">running in production?&lt;/a>&lt;/li>
&lt;li>&lt;a href="#footnotes">footnotes&lt;/a>&lt;/li>
&lt;/ul>
&lt;/nav>
&lt;h1 id="the-process">the process&lt;/h1>
&lt;p>People explore and process data in so many different ways. I always find it helpful to understand how people planned their research before they start. Here&amp;rsquo;s mine:&lt;/p>
&lt;ol>
&lt;li>Define the problem. Identify possible solutions&lt;/li>
&lt;li>Find and archive the relevant data&lt;/li>
&lt;li>Explore data to see if any patterns or features stick out&lt;/li>
&lt;li>Decide on a model&lt;/li>
&lt;li>Train the model&lt;/li>
&lt;li>Test the model&lt;/li>
&lt;li>Time for production, baby&lt;/li>
&lt;/ol>
&lt;h1 id="some-disclaimers">some disclaimers&lt;/h1>
&lt;ul>
&lt;li>you can skip the next section and go right to &lt;a href="#results">results&lt;/a> if you don&amp;rsquo;t care too much for the actual machine learning aspects.&lt;/li>
&lt;li>If you think that I&amp;rsquo;m going to explore state of the art ✨AI✨, then I&amp;rsquo;m sorry to burst your bubble. I&amp;rsquo;m sure there are people out there that are using ✨AI✨ to solve the spam problem. They&amp;rsquo;re probably much smarter than me, and also they probably have a budget that eclipses the national GDP of &lt;a href="https://en.wikipedia.org/wiki/Tuvalu">Tuvalu&lt;/a>. My budget is exactly $0, not including the cost to purchase and power overkill home build PC. I might write about the saga of my computer some day, but suffice to say I bought it in college from a buddy and I&amp;rsquo;ve made substantial improvements over time. Somehow I&amp;rsquo;m still not putting it through it&amp;rsquo;s paces though.&lt;/li>
&lt;/ul>
&lt;p>Okay, now onto the actual work!&lt;/p>
&lt;h1 id="defining-the-problem-and-solution">defining the problem and solution&lt;/h1>
&lt;p>Spam is annoying. Every spam e-mail wastes brainspace and bothers us. It distracts from doing real productive things. To deal with the spam, we can either ignore all notifications or manually review each message to classify them and delete those considered to be spam.&lt;/p>
&lt;p>OR, I can apply some &lt;del>✨AI✨&lt;/del> &lt;del>MACHINE LEARNING&lt;/del> statistical classification algorithms and basic logic to realize the Pareto Principle automatically. A successful outcome would be a model that can screen each received e-mail for spamminess and sucessfully decide between &amp;ldquo;spam&amp;rdquo; and &amp;ldquo;not spam.&amp;rdquo; Of course, since we&amp;rsquo;re receiving client e-mails too, it would be far worse for our model to mistakenly classify a client e-mail as spam than for the model to classiy a spam e-mail as ham. We need to make sure that our errors are biased towards false negatives (Type II) instead of false positives (Type I).&lt;/p>
&lt;h1 id="getting-data-corpussup3f3sup-">getting data (corpus?&lt;sup>&lt;a href="#f3">[3]&lt;/a>&lt;/sup> )&lt;/h1>
&lt;p>Our raw data is e-mails. I have no interest in manually scraping, so I&amp;rsquo;ll use Python&amp;rsquo;s convenience IMAP client libraries to fetch raw data. I have my own wrapper implementations that make this kind of scraping easier, but you can fetch data however you like. For example, I&amp;rsquo;ve heard that the &lt;a href="https://en.wikipedia.org/wiki/Enron_Corpus">Enron e-mail corpus&lt;/a> is a freely available database of 600k e-mails. Apparently it&amp;rsquo;s difficult to get access to this kind of private e-mail data unless you&amp;rsquo;re Google, so this could be interesting to you.&lt;/p>
&lt;details>
&lt;summary>&lt;a>click to expand and display the IMAP download sequence&lt;/a>&lt;/summary>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-python" data-lang="python">&lt;span style="display:flex;">&lt;span> messages &lt;span style="color:#f92672">=&lt;/span> {
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#e6db74">&amp;#39;spam&amp;#39;&lt;/span>: [],
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#e6db74">&amp;#39;ham&amp;#39;&lt;/span>: []
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> }
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">with&lt;/span> mail&lt;span style="color:#f92672">.&lt;/span>IMAPClient(mail&lt;span style="color:#f92672">.&lt;/span>IMAP_SERVER, mail&lt;span style="color:#f92672">.&lt;/span>IMAP_PORT, user, password) &lt;span style="color:#66d9ef">as&lt;/span> client:
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> client&lt;span style="color:#f92672">.&lt;/span>client&lt;span style="color:#f92672">.&lt;/span>select(&lt;span style="color:#e6db74">&amp;#39;inbox&amp;#39;&lt;/span>)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> typ, data &lt;span style="color:#f92672">=&lt;/span> client&lt;span style="color:#f92672">.&lt;/span>client&lt;span style="color:#f92672">.&lt;/span>search(&lt;span style="color:#66d9ef">None&lt;/span>, &lt;span style="color:#e6db74">&amp;#39;ALL&amp;#39;&lt;/span>)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> email_ids &lt;span style="color:#f92672">=&lt;/span> data[&lt;span style="color:#ae81ff">0&lt;/span>]&lt;span style="color:#f92672">.&lt;/span>split()
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> print(&lt;span style="color:#e6db74">f&lt;/span>&lt;span style="color:#e6db74">&amp;#34;found &lt;/span>&lt;span style="color:#e6db74">{&lt;/span>len(email_ids)&lt;span style="color:#e6db74">}&lt;/span>&lt;span style="color:#e6db74"> emails.&amp;#34;&lt;/span>)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">for&lt;/span> num &lt;span style="color:#f92672">in&lt;/span> tqdm&lt;span style="color:#f92672">.&lt;/span>tqdm(email_ids, &lt;span style="color:#e6db74">&amp;#39;fetching emails!&amp;#39;&lt;/span>):
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> typ, data &lt;span style="color:#f92672">=&lt;/span> client&lt;span style="color:#f92672">.&lt;/span>client&lt;span style="color:#f92672">.&lt;/span>fetch(num, &lt;span style="color:#e6db74">&amp;#39;(RFC822)&amp;#39;&lt;/span>)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">for&lt;/span> response_part &lt;span style="color:#f92672">in&lt;/span> data:
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">if&lt;/span> isinstance(response_part, tuple):
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> msg &lt;span style="color:#f92672">=&lt;/span> email&lt;span style="color:#f92672">.&lt;/span>message_from_bytes(response_part[&lt;span style="color:#ae81ff">1&lt;/span>])
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> messages[&lt;span style="color:#e6db74">&amp;#39;ham&amp;#39;&lt;/span>]&lt;span style="color:#f92672">.&lt;/span>append(msg)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">with&lt;/span> mail&lt;span style="color:#f92672">.&lt;/span>IMAPClient(mail&lt;span style="color:#f92672">.&lt;/span>IMAP_SERVER, mail&lt;span style="color:#f92672">.&lt;/span>IMAP_PORT, user, password) &lt;span style="color:#66d9ef">as&lt;/span> client:
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> client&lt;span style="color:#f92672">.&lt;/span>client&lt;span style="color:#f92672">.&lt;/span>select(&lt;span style="color:#e6db74">&amp;#39;spam_training&amp;#39;&lt;/span>)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> typ, data &lt;span style="color:#f92672">=&lt;/span> client&lt;span style="color:#f92672">.&lt;/span>client&lt;span style="color:#f92672">.&lt;/span>search(&lt;span style="color:#66d9ef">None&lt;/span>, &lt;span style="color:#e6db74">&amp;#39;ALL&amp;#39;&lt;/span>)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> email_ids &lt;span style="color:#f92672">=&lt;/span> data[&lt;span style="color:#ae81ff">0&lt;/span>]&lt;span style="color:#f92672">.&lt;/span>split()
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> print(&lt;span style="color:#e6db74">f&lt;/span>&lt;span style="color:#e6db74">&amp;#34;found &lt;/span>&lt;span style="color:#e6db74">{&lt;/span>len(email_ids)&lt;span style="color:#e6db74">}&lt;/span>&lt;span style="color:#e6db74"> emails.&amp;#34;&lt;/span>)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">for&lt;/span> num &lt;span style="color:#f92672">in&lt;/span> tqdm&lt;span style="color:#f92672">.&lt;/span>tqdm(email_ids, &lt;span style="color:#e6db74">&amp;#39;fetching emails!&amp;#39;&lt;/span>):
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> typ, data &lt;span style="color:#f92672">=&lt;/span> client&lt;span style="color:#f92672">.&lt;/span>client&lt;span style="color:#f92672">.&lt;/span>fetch(num, &lt;span style="color:#e6db74">&amp;#39;(RFC822)&amp;#39;&lt;/span>)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">for&lt;/span> response_part &lt;span style="color:#f92672">in&lt;/span> data:
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">if&lt;/span> isinstance(response_part, tuple):
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> msg &lt;span style="color:#f92672">=&lt;/span> email&lt;span style="color:#f92672">.&lt;/span>message_from_bytes(response_part[&lt;span style="color:#ae81ff">1&lt;/span>])
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> messages[&lt;span style="color:#e6db74">&amp;#39;spam&amp;#39;&lt;/span>]&lt;span style="color:#f92672">.&lt;/span>append(msg)
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;/details>
&lt;p>The basic fetching algorithm that I use is to download every e-mail from the main inbox and a folder named &lt;code>spam_training&lt;/code>.&lt;/p>
&lt;p>Then I extract the message body and the non-body items in the message (i.e. all of the e-mail headers). You might notice in stats below that the total e-mail counts are less than 1,000. Since I was only able to manually classify a couple hundred e-mails before my fingers fell off, I constrained the non-spam content to random sample of the e-mails to get close-ish to matching the number of e-mails that I classified. In doing this, I create a more &amp;ldquo;balanced&amp;rdquo; dataset that allows stastical machine learning algorithms to focus in on the gap between non-spam and spam classes. If we used the full dataset, there would be far more &lt;strong>real&lt;/strong> e-mails than spam e-mails and our models would need to learn on a very limited dataset.&lt;/p>
&lt;p>At this point, I have a pandas DataFrame where every e-mail header is a column and the e-mail body a column with NLTK stopwords excluded. Time to do some explorin'&lt;/p>
&lt;details>
&lt;summary>&lt;a>click for the code for cleaning documents into a pandas DataFrame&lt;/a>&lt;/summary>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-python" data-lang="python">&lt;span style="display:flex;">&lt;span> rows &lt;span style="color:#f92672">=&lt;/span> []
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">for&lt;/span> m &lt;span style="color:#f92672">in&lt;/span> messages[&lt;span style="color:#e6db74">&amp;#39;ham&amp;#39;&lt;/span>]:
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> row &lt;span style="color:#f92672">=&lt;/span> dict(m&lt;span style="color:#f92672">.&lt;/span>items())
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> row[&lt;span style="color:#e6db74">&amp;#39;body&amp;#39;&lt;/span>] &lt;span style="color:#f92672">=&lt;/span> mail&lt;span style="color:#f92672">.&lt;/span>get_body_robust(m)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> row[&lt;span style="color:#e6db74">&amp;#39;is_spam&amp;#39;&lt;/span>] &lt;span style="color:#f92672">=&lt;/span> &lt;span style="color:#66d9ef">False&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> rows&lt;span style="color:#f92672">.&lt;/span>append(row)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">for&lt;/span> m &lt;span style="color:#f92672">in&lt;/span> messages[&lt;span style="color:#e6db74">&amp;#39;spam&amp;#39;&lt;/span>]:
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> row &lt;span style="color:#f92672">=&lt;/span> dict(m&lt;span style="color:#f92672">.&lt;/span>items())
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> row[&lt;span style="color:#e6db74">&amp;#39;body&amp;#39;&lt;/span>] &lt;span style="color:#f92672">=&lt;/span> mail&lt;span style="color:#f92672">.&lt;/span>get_body_robust(m)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> row[&lt;span style="color:#e6db74">&amp;#39;is_spam&amp;#39;&lt;/span>] &lt;span style="color:#f92672">=&lt;/span> &lt;span style="color:#66d9ef">True&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> rows&lt;span style="color:#f92672">.&lt;/span>append(row)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> df &lt;span style="color:#f92672">=&lt;/span> pdut&lt;span style="color:#f92672">.&lt;/span>clean_cols(pd&lt;span style="color:#f92672">.&lt;/span>DataFrame(rows)&lt;span style="color:#f92672">.&lt;/span>dropna(how&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#e6db74">&amp;#39;all&amp;#39;&lt;/span>, axis&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#ae81ff">1&lt;/span>))
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> stop_words &lt;span style="color:#f92672">=&lt;/span> spam&lt;span style="color:#f92672">.&lt;/span>get_stopwords(&lt;span style="color:#e6db74">&amp;#39;english&amp;#39;&lt;/span>)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> df[&lt;span style="color:#e6db74">&amp;#39;clean_body&amp;#39;&lt;/span>] &lt;span style="color:#f92672">=&lt;/span> df[&lt;span style="color:#e6db74">&amp;#39;body&amp;#39;&lt;/span>]&lt;span style="color:#f92672">.&lt;/span>apply(clean_text, stop_words&lt;span style="color:#f92672">=&lt;/span>stop_words)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> df[&lt;span style="color:#e6db74">&amp;#39;doc&amp;#39;&lt;/span>] &lt;span style="color:#f92672">=&lt;/span> df[[&lt;span style="color:#e6db74">&amp;#39;clean_body&amp;#39;&lt;/span>, &lt;span style="color:#e6db74">&amp;#39;subject&amp;#39;&lt;/span>]]&lt;span style="color:#f92672">.&lt;/span>apply(&lt;span style="color:#66d9ef">lambda&lt;/span> x: json&lt;span style="color:#f92672">.&lt;/span>dumps(x&lt;span style="color:#f92672">.&lt;/span>to_json()), axis&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#ae81ff">1&lt;/span>)
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;/details>
&lt;h1 id="exploratory-data-analysis">exploratory data analysis&lt;/h1>
&lt;p>I&amp;rsquo;m a bit opinionated when it comes to statistics. I think a little subject matter expertise goes a long way in finding predictive features. If you&amp;rsquo;re not personally an expert, then that&amp;rsquo;s fine, just go on social media and try to find one to learn from for a bit. You can use that expertise and learning to explore the data without wasting a lot of time.&lt;/p>
&lt;p>Fortunately, I have personal experience seeing the spam that Jessie receives. There&amp;rsquo;s some patterns that jump out.&lt;/p>
&lt;h2 id="1-subject-line">1. subject line&lt;/h2>
&lt;p>First, I kept seeing e-mails with no subject. Weird.&lt;/p>
&lt;p>&lt;img src="https://varunhegde.com/images/spam/email_subj_length.png#center" alt="Who sends e-mails without a subject line?">&lt;/p>
&lt;p>If I look at the actual distribution of spam v. non-spam e-mails, you can see that 65% (&lt;code>88 / (88 +134)&lt;/code>) of the spam has no subject. If we can find a way to identify empty subject e-mails that are spam, we&amp;rsquo;ll remove more than half of the spam. Conveniently, only 0.65% (&lt;code>88 / (88 +134)&lt;/code>) of the non-spam has no subject&lt;sup>&lt;a href="#f4">[4]&lt;/a>&lt;/sup>. This means we can pretty confidently say &amp;ldquo;if the e-mail has no subject, it&amp;rsquo;s spam.&amp;rdquo; Silly spammers, making this too easy, I don&amp;rsquo;t even need to whip out the ✨AI✨ yet.&lt;/p>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th style="text-align: left">&lt;/th>
&lt;th style="text-align: left">not spam&lt;/th>
&lt;th style="text-align: left">is spam&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td style="text-align: left">has subj&lt;/td>
&lt;td style="text-align: left">764&lt;/td>
&lt;td style="text-align: left">134&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td style="text-align: left">empty subj&lt;/td>
&lt;td style="text-align: left">5&lt;/td>
&lt;td style="text-align: left">88&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;h2 id="2-from-address">2. from address&lt;/h2>
&lt;p>Second, I notice that most of the spam comes from new gmail.com addresses. Using the same lens as the empty subject exploration, 95% of the spam comes from an email ending in &lt;code>@gmail.com&lt;/code>. If we can find a way to identify gmail addresses that are spammers, we&amp;rsquo;ll remove 95% of the spam. Unfortunately, while most spam is from Gmail, many Gmail e-mails (32%, or 97 out of the total 97 + 210 Gmail) are decidely NOT spam. Many of these are e-mails from me to Jessie, so I hope she doesn&amp;rsquo;t think I&amp;rsquo;m spam!&lt;/p>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th style="text-align: left">&lt;/th>
&lt;th style="text-align: left">not spam&lt;/th>
&lt;th style="text-align: left">is spam&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td style="text-align: left">not Gmail&lt;/td>
&lt;td style="text-align: left">672&lt;/td>
&lt;td style="text-align: left">12&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td style="text-align: left">is Gmail&lt;/td>
&lt;td style="text-align: left">97&lt;/td>
&lt;td style="text-align: left">210&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;h2 id="3-bro-sign-up-for-my-course-free-dictionary-included">3. bro sign up for my course (free dictionary included)&lt;/h2>
&lt;p>Third, I vaguely noticed that a lot of these spam e-mails use the same rough vocabulary. Why? Maybe because there&amp;rsquo;s a video out there called &lt;a href="https://rroll.to/qKKhQ0">Cold Calling For Beginners: A Step-by-Step Guide to Book Sales Meetings&lt;/a>, or because there are more people selling cold calling courses than there are cold calls to make? Don&amp;rsquo;t even get me started on how there are 10x more courses on how to get started in profitable real estate flipping than there are actual opportunities to profitably flip a house.&lt;/p>
&lt;p>&lt;img src="https://varunhegde.com/images/spam/cold_calling1901.png#center" alt="Who sends e-mails without a subject line?">&lt;/p>
&lt;p>My personal angst at telemarketers (and now e-mail marketers) aside, the point is that the vocabulary that they use is drab, rehearsed, copy-pasted, automated, etc. It&amp;rsquo;s certainly not ✨AI✨ generated using the state of the art LLMs. I suspect this is because if the e-mail were too good then they would attract a higher proportion of qualified leads and these spammers mostly make money on the &lt;em>UN&lt;/em>-qualified leads (i.e. how the Nigerian prince scam works. I&amp;rsquo;ll probably write about this sometime). Ok back to the topic at hand: if you make a wordcloud of what is in the e-mails I classify as spam, it&amp;rsquo;s immediately clear that there really are a lot of repeats.&lt;/p>
&lt;p>&lt;img src="https://varunhegde.com/images/spam/spam_wordcloud.png#center" alt="want to convert he most sales for store for no commission? just jumble these words">&lt;/p>
&lt;p>It looks like somebody is telling spammers to jumble together a mix of &amp;ldquo;conversion&amp;rdquo;, &amp;ldquo;product&amp;rdquo;, &amp;ldquo;sale&amp;rdquo;, &amp;ldquo;store&amp;rdquo;, &amp;ldquo;traffic&amp;rdquo;, &amp;ldquo;partnership&amp;rdquo;, &amp;ldquo;customer&amp;rdquo;, and &amp;ldquo;result&amp;rdquo;. I really hope that the spammers aren&amp;rsquo;t paying for a course to learn how to do this.&lt;/p>
&lt;h2 id="4-quick-question-for-you-varunhegdecom">4. &amp;ldquo;quick question for you, varunhegde.com&amp;rdquo;&lt;/h2>
&lt;p>I swear ✨AI✨ is either going to make spamming way easier or way way dumber. If you noticed in the wordcloud above, the spammers just love to say &amp;ldquo;jessiereneenyc&amp;rdquo; all over the place. When you read the actual e-mails, it looks something like this:&lt;/p>
&lt;blockquote>
&lt;p>Hi &lt;em>&lt;strong>Jessiereneenyc&lt;/strong>&lt;/em>, Out of curiosity, are most of your sales coming from returning customers, or are you focusing on bringing in new visitors right now?&lt;/p>
&lt;/blockquote>
&lt;p>No normal person writes an e-mail like this I hope. The data confirms this too:&lt;/p>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th style="text-align: left">&lt;/th>
&lt;th style="text-align: left">not spam&lt;/th>
&lt;th style="text-align: left">is spam&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td style="text-align: left">baseline&lt;/td>
&lt;td style="text-align: left">611&lt;/td>
&lt;td style="text-align: left">65&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td style="text-align: left">&amp;ldquo;jessiereneenyc&amp;rdquo;&lt;/td>
&lt;td style="text-align: left">158&lt;/td>
&lt;td style="text-align: left">157&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;h1 id="constructing-features">constructing features&lt;/h1>
&lt;p>Ok, after all this, we have 4-ish feature ideas. My gut ranks them like so:&lt;/p>
&lt;ol>
&lt;li>empty subject &amp;ndash;&amp;gt; almost definitely spam&lt;/li>
&lt;li>uses the exact text &amp;ldquo;jessiereneenyc&amp;rdquo; &amp;ndash;&amp;gt; almost definitely spam&lt;/li>
&lt;li>generic vocabulary matching probably implies spam&lt;/li>
&lt;li>from a Gmail address &amp;ndash;&amp;gt; often spam, also encompasses most of the spam&lt;/li>
&lt;/ol>
&lt;p>features 1, 2, and 4 are easy features to build, here&amp;rsquo;s the code:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-python" data-lang="python">&lt;span style="display:flex;">&lt;span> feats[&lt;span style="color:#e6db74">&amp;#39;no_subject&amp;#39;&lt;/span>] &lt;span style="color:#f92672">=&lt;/span> (feats[&lt;span style="color:#e6db74">&amp;#39;subject&amp;#39;&lt;/span>]&lt;span style="color:#f92672">.&lt;/span>str&lt;span style="color:#f92672">.&lt;/span>len() &lt;span style="color:#f92672">==&lt;/span> &lt;span style="color:#ae81ff">0&lt;/span>)&lt;span style="color:#f92672">.&lt;/span>astype(int)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> feats[&lt;span style="color:#e6db74">&amp;#39;has_jessiereneenyc&amp;#39;&lt;/span>] &lt;span style="color:#f92672">=&lt;/span> (feats[&lt;span style="color:#e6db74">&amp;#39;clean_body&amp;#39;&lt;/span>]&lt;span style="color:#f92672">.&lt;/span>str&lt;span style="color:#f92672">.&lt;/span>contains(&lt;span style="color:#e6db74">&amp;#39;jessiereneenyc&amp;#39;&lt;/span>) &lt;span style="color:#f92672">|&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> feats[&lt;span style="color:#e6db74">&amp;#39;subject&amp;#39;&lt;/span>]&lt;span style="color:#f92672">.&lt;/span>str&lt;span style="color:#f92672">.&lt;/span>lower()&lt;span style="color:#f92672">.&lt;/span>str&lt;span style="color:#f92672">.&lt;/span>contains(&lt;span style="color:#e6db74">&amp;#39;jessiereneenyc&amp;#39;&lt;/span>))&lt;span style="color:#f92672">.&lt;/span>astype(int)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> feats[&lt;span style="color:#e6db74">&amp;#39;is_gmail&amp;#39;&lt;/span>] &lt;span style="color:#f92672">=&lt;/span> feats[&lt;span style="color:#e6db74">&amp;#39;from&amp;#39;&lt;/span>]&lt;span style="color:#f92672">.&lt;/span>str&lt;span style="color:#f92672">.&lt;/span>contains(&lt;span style="color:#e6db74">&amp;#39;@gmail.com&amp;#39;&lt;/span>)&lt;span style="color:#f92672">.&lt;/span>astype(int)
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>feature 3 is a little harder, and we&amp;rsquo;ll rely on proper machine learning techniques to build it. Since we know that the feature should basically map to &amp;ldquo;similarity to spammy e-mail vocabulary&amp;rdquo;, if we can construct a generic model for spammy vocabulary we&amp;rsquo;re in a great spot. Fortunately, some very smart people have developed lots of really impressive models to do exactly this.&lt;/p>
&lt;p>I&amp;rsquo;ll use a technique called &lt;a href="https://en.wikipedia.org/wiki/Tf%E2%80%93idf">Term Frequency-Inverse Document Frequency&lt;/a> (TF-IDF) Vectorization to transform each e-mail&amp;rsquo;s content into a vector that represents how frequently each word in each e-mail is used. The &amp;ldquo;term-frequency&amp;rdquo; (TF) of a word in a document is just the relative frequency of that term in that document. In layman&amp;rsquo;s terms TF for &amp;ldquo;apple&amp;rdquo; is computed by counting all the times &amp;ldquo;apple&amp;rdquo; is in a document and then dividing by the number of words in the document. Inverse document frequency (IDF) is how common a word is across all of the documnets in the provided dataset. Again using the &amp;ldquo;apple&amp;rdquo; example, divide the total number of documents by the number of documents that &amp;ldquo;apple&amp;rdquo; shows up in at least once. Take the log of that number. TF-IDF then is just the TF x IDF for a specific word.&lt;/p>
&lt;p>All of this math seems onerous, but again, smarter minds have done the hard work for me. sci-kit-learn in Python has an implementation called &lt;code>TfidfVectorizer&lt;/code> that will generate a matrix of tf-idf terms for a specific set of documents. One line of code!&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-python" data-lang="python">&lt;span style="display:flex;">&lt;span> vectorizer &lt;span style="color:#f92672">=&lt;/span> TfidfVectorizer(max_features&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#ae81ff">500&lt;/span>)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> tfidf_feats &lt;span style="color:#f92672">=&lt;/span> vectorizer&lt;span style="color:#f92672">.&lt;/span>fit_transform(feats[&lt;span style="color:#e6db74">&amp;#39;doc&amp;#39;&lt;/span>])
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h1 id="training-and-testing-the-model">training and testing the model&lt;/h1>
&lt;p>I now have some heuristic understanding, some labelled data, basic features, some fancier features. I can finally train a model!&lt;/p>
&lt;p>WRONG, I have to pick which ones to try first. There are so many to choose from, but if you follow the handy dandy guide that sci-kit-learn provides you don&amp;rsquo;t even need to know the pros and cons of the various ones!&lt;/p>
&lt;p>&lt;img src="https://varunhegde.com/images/spam/ml_map.png#center" alt="wow flow charts amazing">&lt;/p>
&lt;p>If you follow the flow chart:&lt;/p>
&lt;ul>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> I have more than 50 samples (N=~900, maybe I&amp;rsquo;ll pull in more e-mails if I get bored)&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> I am predicting a category (&lt;code>spam&lt;/code> or &lt;code>ham&lt;/code>)&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> I do have labeled data&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> I have less than 100k samples (no SGD or kernel approximation for me)&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> the data includes text-based features&lt;/li>
&lt;/ul>
&lt;p>What that leaves behind is Naive Bayes classifiers. I&amp;rsquo;ll use a Multinomial Bayes model because the features are all binomial (my special features) or multinomial (tf-idf feats) by definition all effectively take on counts or frequencies.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-python" data-lang="python">&lt;span style="display:flex;">&lt;span> X_train, X_test, y_train, y_test &lt;span style="color:#f92672">=&lt;/span> train_test_split(X, y, test_size&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#ae81ff">0.2&lt;/span>, random_state&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#ae81ff">42&lt;/span>)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> other_feats &lt;span style="color:#f92672">=&lt;/span> [&lt;span style="color:#e6db74">&amp;#39;no_subject&amp;#39;&lt;/span>, &lt;span style="color:#e6db74">&amp;#39;has_jessiereneenyc&amp;#39;&lt;/span>, &lt;span style="color:#e6db74">&amp;#39;is_gmail&amp;#39;&lt;/span>]
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">if&lt;/span> len(other_feats):
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> X &lt;span style="color:#f92672">=&lt;/span> sp&lt;span style="color:#f92672">.&lt;/span>hstack([tfidf_feats, sp&lt;span style="color:#f92672">.&lt;/span>csr_array(feats[other_feats])])
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">else&lt;/span>:
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> X &lt;span style="color:#f92672">=&lt;/span> tfidf_feats
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> y &lt;span style="color:#f92672">=&lt;/span> feats[&lt;span style="color:#e6db74">&amp;#39;is_spam&amp;#39;&lt;/span>]&lt;span style="color:#f92672">.&lt;/span>astype(int)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> model &lt;span style="color:#f92672">=&lt;/span> MultinomialNB()
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> model&lt;span style="color:#f92672">.&lt;/span>fit(X_train, y_train)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> feats[&lt;span style="color:#e6db74">&amp;#39;is_spam_pred&amp;#39;&lt;/span>] &lt;span style="color:#f92672">=&lt;/span> pd&lt;span style="color:#f92672">.&lt;/span>Series(model&lt;span style="color:#f92672">.&lt;/span>predict(X), index&lt;span style="color:#f92672">=&lt;/span>df&lt;span style="color:#f92672">.&lt;/span>index)&lt;span style="color:#f92672">.&lt;/span>astype(bool)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> y_pred &lt;span style="color:#f92672">=&lt;/span> feats&lt;span style="color:#f92672">.&lt;/span>loc[y_test&lt;span style="color:#f92672">.&lt;/span>index, &lt;span style="color:#e6db74">&amp;#39;is_spam_pred&amp;#39;&lt;/span>]
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h1 id="results">results&lt;/h1>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th colspan="1">&lt;/th>
&lt;th colspan="2">Training&lt;/th>
&lt;th colspan="2">Testing&lt;/th>
&lt;/tr>
&lt;tr>
&lt;th>&lt;/th>
&lt;th>Not Spam&lt;/th>
&lt;th>Is Spam&lt;/th>
&lt;th>Not Spam&lt;/th>
&lt;th>Is Spam&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>Not Predicted&lt;/td>
&lt;td>599&lt;/td>
&lt;td>18&lt;/td>
&lt;td>149&lt;/td>
&lt;td>3&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Predicted Spam&lt;/td>
&lt;td>7&lt;/td>
&lt;td>168&lt;/td>
&lt;td>2&lt;/td>
&lt;td>45&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Accuracy&lt;/td>
&lt;td colspan="2">96.8%&lt;/td>
&lt;td colspan="2">97.5%&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Precision&lt;/td>
&lt;td colspan="2">90.3%&lt;/td>
&lt;td colspan="2">93.8%&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>F1 Score&lt;/td>
&lt;td colspan="2">93.1%&lt;/td>
&lt;td colspan="2">94.7%&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;p>I trained the model on 80% of the data and then tested against the remaining 20%. Unlike in the hedge fund world, I don&amp;rsquo;t need to worry too much about forward looking bias: the spammers are hopefully not changing their methods over time to extract alpha from the spam market.&lt;/p>
&lt;p>The results speak for themselves:&lt;/p>
&lt;ul>
&lt;li>I can correctly guess spam v. ham 97.5% of the time OUT OF SAMPLE&lt;/li>
&lt;li>I mistakenly classify 6% of true spam as ham: that&amp;rsquo;s not so bad&lt;/li>
&lt;li>I mistakenly classify 1.3% of true ham as spam.&lt;/li>
&lt;/ul>
&lt;p>Importantly, the false positive rate is pretty low at 1%. Do you think a person that has to sort through 200 e-mails will miss more than 2 e-mails becuase they have to wade through 50 spam e-mails?&lt;/p>
&lt;p>Now, that same person only has to review 148 e-mails and they can be 98% confident that they&amp;rsquo;re not missing anything important.&lt;/p>
&lt;h1 id="running-in-production">running in production?&lt;/h1>
&lt;p>My rinky dink spam detection model trained on a random sample of 992 e-mails classifies does a relatively phenomenal job for a short Friday evening hacknight. Probably took me longer to write this post than it did to do the analysis.&lt;/p>
&lt;p>Now all I need to do is scrape together a service that iteratively moves high probability spam e-mails into a &amp;ldquo;likely-spam&amp;rdquo; folder and then continuing to retrain over time with new data.&lt;/p>
&lt;h1 id="footnotes">footnotes&lt;/h1>
&lt;p>&lt;span id="f1">[1]&lt;/span> - I&amp;rsquo;m actually censoring the e-mail here so that they don&amp;rsquo;t find out I&amp;rsquo;m making fun of them in case they&amp;rsquo;re malicious. Ironically, it also prevents them from receiving spam from scraper bots in return.&lt;/p>
&lt;p>&lt;span id="f2">[2]&lt;/span> - Jessie, if you&amp;rsquo;re reading this, I know you plan to get Gmail set up as soon as it makes sense :)&lt;/p>
&lt;p>&lt;span id="f3">[3]&lt;/span> - this is such a fun word. so domain specific but it does make me feel real fancy to use it. I knew the SAT would come in handy some day&lt;/p>
&lt;p>&lt;span id="f4">[4]&lt;/span> - If I actually go to check those e-mails they actually are spam too, I&amp;rsquo;m just too lazy to reclassify them.&lt;/p></description></item><item><title>The dark web 101: what it is, how it works, and why people use it</title><link>https://varunhegde.com/the-dark-web-101-what-it-is-how-it-works-and-why-people-use-it/</link><pubDate>Fri, 07 Nov 2025 21:37:09 -0500</pubDate><author>varun@varunhegde.com (Varun Hegde)</author><guid>https://varunhegde.com/the-dark-web-101-what-it-is-how-it-works-and-why-people-use-it/</guid><description>&lt;p>Interesting fun fact, the internet only connects about 5.5 billion people as of 2024. Not surprisingly, most of these people live in developed countries as you can see in the animation below from the &lt;a href="https://web.archive.org/web/20170915113829/http://data.un.org/Data.aspx?d=ITU&amp;amp;f=ind1Code%3aI99H">United Nations&lt;/a>.&lt;/p>
&lt;p>&lt;img src="https://varunhegde.com/images/onions/Percentage_of_individuals_using_the_Internet_1990-2014.gif#center" alt="Percentage of Individuals Using the Internet">&lt;/p>
&lt;p>There are infinite ways to classify the internet, but I think that the most &lt;em>mysterious&lt;/em> one is the breakdown of &lt;em>clearnet&lt;/em>, &lt;em>deepweb&lt;/em>, and &lt;em>darkweb&lt;/em>. Most people that are aware of the distinction between those three terms don&amp;rsquo;t get too hot and bothered about the darkweb or the deepweb, but to the layman, the darkweb sounds like a scary place where you can only find spooky and unsavory things. I&amp;rsquo;ve always had a fascination with tech so I thought I&amp;rsquo;d do some research myself to understand things a little better.&lt;/p>
&lt;p>Unfortunately, to truly understand the dark web you also kind of need to understand the basics of what makes the regular old internet tick. If you&amp;rsquo;re already familiar with the basic then you can skip ahead to &lt;a href="#some-definitions">Some Definitions&lt;/a>&lt;/p>
&lt;h2 id="the-modern-internet">The Modern Internet&lt;/h2>
&lt;h3 id="a-flash-of-light-as-the-birth-of-the-web">A Flash of Light as the Birth of the Web&lt;/h3>
&lt;p>Before the mushroom clouds in the Oppenheimer movie were a twinkle in Christopher Nolan’s eye, there was the real Oppenheimer. He&amp;rsquo;s famous for his quotes, one of my favorites being: &amp;ldquo;We knew the world would not be the same. A few people laughed, a few people cried.&amp;rdquo; No one at the time could have imagined that the nuclear bomb would spark a mad race to prevent further destruction which resulted in the internet.&lt;/p>
&lt;p>In the 1960s, the United States Air Force was trying to improve its ability to ensure mutually assured destruction (MAD). My seventh-grade history teacher loved that term for all the right reasons. MAD was the big idea to prevent future nuclear conflict. It all boiled down to the most basic of human fears: if you nuke us, we nuke you. Simple game theory says that the stable equilibrium is no nuclear launches as long as both sides can still communicate and coordinate a retaliatory strike.&lt;/p>
&lt;p>To guarantee that communication, the military needed systems that could survive jamming, attacks, and partial blackouts. These systems needed to send messages over long distances without relying on a direct connection.&lt;/p>
&lt;p>Enter packet switching: the foundation of the modern internet.&lt;/p>
&lt;h3 id="packet-switching-arpanet-and-tcpip">Packet Switching, ARPANET, and TCP/IP&lt;/h3>
&lt;p>The core issue with point-to-point communication systems is that we want to send arbitrarily long messages over an unreliable messaging medium (all of them are, to some extent). For the sake of argument, let&amp;rsquo;s say that you&amp;rsquo;re trying to send a message cross country. It&amp;rsquo;s critically important that the message arrive and you don&amp;rsquo;t want to take any chances that your recipient misses even a piece.&lt;/p>
&lt;p>You could send messages over radio in their entirety or encoded/encrypted messages over morse code or telegram, but intermittent connection mean data loss.&lt;/p>
&lt;p>Packet switching is the idea that you can chunk messages into fixed &lt;em>&lt;strong>packets&lt;/strong>&lt;/em> of information that are transmitted over an arbitrary network medium. The packet need only contain enough information to identify the start of the message, the address, sender, and an id for ordering parts of, the piece of the message, and lastly a little piece to identify the end of the message. This way, instead of needing to be able to send the whole message from point A to point B, you can reliably transport most of it in pieces and just have the recipient re-request for any missing pieces. You can see a sample of a data packet here:&lt;/p>
&lt;p>&lt;img src="https://varunhegde.com/images/onions/packet.jpg#center" alt="Sample Packet">&lt;/p>
&lt;p>If you break a message into 100 packets and your network drops 10%, you just resend the missing packets until the recipient has all of them.
All you need is a network of computers willing to relay packets. The first such network was ARPANET, created by DARPA. ARPANET didn’t initially support lost packet recovery - if the network failed to deliver, the sender and receiver simply got stuck waiting.&lt;/p>
&lt;p>To fix this, early developers created a protocol that sits on top of the network called Transmission Control Protocol/Internet Protocol (TCP/IP). This protocol had roughly 4 ground rules:&lt;/p>
&lt;ol>
&lt;li>Each distinct sub-network in the broader network could/should stand on its own. No overarching network could impose requirements for other networks to connect.&lt;/li>
&lt;li>Communications (i.e. packets) would be forwarded on a best efforts basis. If a packet didn&amp;rsquo;t make it to the final destination, it would be up to the packet source to retransmit.&lt;/li>
&lt;li>Intermediate connection points would store no information about the individual flows of packets to keep them simpler and low latency&lt;/li>
&lt;li>No global authority would/could own the network&lt;/li>
&lt;/ol>
&lt;p>They also needed to handle the following issues:&lt;/p>
&lt;ul>
&lt;li>Algorithms to prevent lost packets from permanently disabling communications and enabling them to be successfully retransmitted from the source.&lt;/li>
&lt;li>Providing for host-to-host “pipelining” so that multiple packets could be enroute from source to destination at the discretion of the participating hosts, if the intermediate networks allowed it.&lt;/li>
&lt;li>Gateway functions to allow it to forward packets appropriately. This included interpreting IP headers for routing, handling interfaces, breaking packets into smaller pieces if necessary, etc.&lt;/li>
&lt;li>The need for end-end checksums, reassembly of packets from fragments and detection of duplicates, if any.&lt;/li>
&lt;li>The need for global addressing&lt;/li>
&lt;li>Techniques for host-to-host flow control.&lt;/li>
&lt;li>Interfacing with the various operating systems&lt;/li>
&lt;li>There were also other concerns, such as implementation efficiency, internetwork performance, but these were secondary considerations at first.&lt;/li>
&lt;/ul>
&lt;h3 id="httphttps">HTTP/HTTPS&lt;/h3>
&lt;p>There’s a fundamental issue with sending TCP/IP packets: packet contents are &lt;em>&lt;strong>cleartext&lt;/strong>&lt;/em>. Anyone along the route can read them, like passing a note to your crush through a classroom and every kid opens it on the way.&lt;/p>
&lt;p>What we need is some way for the messages to be hidden from prying eyes but for the correct recipient to be able to read the message without anybody else reading it. Enter encryption. Cryptography is a field that is far too deep to even start with a 101 in this post, but maybe I&amp;rsquo;ll get into it another time. For now, it suffices to say that there is a protocol called Hypertext Transfer Protocol (HTTP) which sends data between two communicating services, which can be encrypted using Transport Layer Security (TLS). This encrypted variant of HTTP is called HTTPS (HTTP Secured).&lt;/p>
&lt;h3 id="summary">Summary&lt;/h3>
&lt;p>The modern internet is a stack of different protocols:&lt;/p>
&lt;ol>
&lt;li>Physical - the actual wires connecting computers&lt;/li>
&lt;li>Data link - allows sending &amp;ldquo;frames&amp;rdquo; of data&lt;/li>
&lt;li>Network - layerleyer that coordinates sending &lt;em>packets&lt;/em> across a network, including addressing, routing, and traffic control&lt;/li>
&lt;li>Transport (TCP, UDP)&lt;/li>
&lt;li>Session - managing continuous streams of data between two clients&lt;/li>
&lt;li>Presentation - encoding, compression, encryption/decryption&lt;/li>
&lt;li>Application (HTTP, HTTPS, FTP, FTPS, SFTP, SMTP) - high level protocols that coordinate how to actually exchange data&lt;/li>
&lt;/ol>
&lt;p>These layers enable encrypted communicatino over unreliable networks. Fleshing out the summary here into more details would be a literal textbook, so I won&amp;rsquo;t go into more details ARPANET, TCP/IP, HTTP, SSL, or TLS. If you&amp;rsquo;re interested you can follow &lt;a href="https://www.internetsociety.org/internet/history-internet/brief-history-internet/">A Brief History of the Internet&lt;/a> and all the online resources like the &lt;a href="https://en.wikipedia.org/wiki/OSI_model">OSI Model&lt;/a> overview on Wikipedia that describe the modern web stack. Those articles have much better technical writers than I and have included more detail than I&amp;rsquo;ll ever be able to address.&lt;/p>
&lt;p>With the basics covered, we can define the terms from the introduction.&lt;/p>
&lt;h2 id="some-definitions">Some Definitions&lt;/h2>
&lt;ul>
&lt;li>clearnet, surface web - the shallowest layer of the internet. Sites that are accessible by the broad public with limited anonymity or barriers to entry. For example, public Twitter, the NYTimes, Reddit, etc. You can confidently send messages to and receive message from clearnet sites, but you can&amp;rsquo;t guarantee that the communication is encrypted or anonymous. Snoops can figure out who the communicating parties are pretty easily by monitoring packet traffic.&lt;/li>
&lt;li>deep web - hidden and unindexed websites. You can&amp;rsquo;t just find these by searching Google or you may need to log in to even know they exist. This includes banking websites and pages hidden behind paywalls&lt;/li>
&lt;li>dark web - web sites and services which are unindexed, and inaccessible except by special encryption schemes that statistically guarantee the user and service&amp;rsquo;s anonymity and security against the most basic attack vectors.&lt;/li>
&lt;/ul>
&lt;p>&lt;img src="https://varunhegde.com/images/onions/Iceberg_of_Webs.svg.png#center" alt="Sample Packet">&lt;/p>
&lt;h2 id="who-needs-the-darkweb">Who Needs the Darkweb?&lt;/h2>
&lt;p>Even though the modern internet stack using HTTPS hides message contents, it does not hide &lt;em>&lt;strong>WHO&lt;/strong>&lt;/em> is communicating. To normal people, this is no big deal. This could literally be life or death for:&lt;/p>
&lt;ol>
&lt;li>Political dissidents: try being a Russian critic of Putin or a Chinese critic of the CCP. These folks could use a way to communicate and share information without anybody knowing that they&amp;rsquo;re communicating&lt;/li>
&lt;li>Journalists: need to be able to communicate with sources securely (provided by HTTPS) &lt;em>&lt;strong>AND ANONYMOUSLY&lt;/strong>&lt;/em>&lt;/li>
&lt;li>Whistleblowers: they&amp;rsquo;ll need to communicate anonymously with tip off points&lt;/li>
&lt;li>Criminals: probably obvious why they would want to communicate anonymously&lt;/li>
&lt;li>People looking to prevent digital advertisers from tracking them&lt;/li>
&lt;li>People that want to support better privacy for all internet users&lt;/li>
&lt;/ol>
&lt;p>The &lt;em>&lt;strong>dark web&lt;/strong>&lt;/em> solves this problem by introducing a statistically anonymous communication protocol.&lt;/p>
&lt;h2 id="the-onion-router">The Onion Router&lt;/h2>
&lt;p>If you&amp;rsquo;re any of the above cateogries, you might want to use onion routing.&lt;/p>
&lt;p>The protocol is pretty simple. Instead of sending messages directly from sender to recipient, packets are routed in a secure layer from sender to intermediary 1, then intermediary 2, then to intermediary 3 and finally to the receipient.&lt;/p>
&lt;p>To guarantee anonymity of the sender and recipient, the original message is wrapped in encrypted envelops in multiple layers. If intermediaries pass enough messages, there is a statistically low probability that somebody watching messages passed through the network could back out the original sender and intended recipient. That snoop would need to be watching a large portion of the network simultaneously.&lt;/p>
&lt;p>Palo Alto Networks has a pretty good &lt;a href="https://unit42.paloaltonetworks.com/tor-traffic-enterprise-networks/">overview of TOR&lt;/a>. The procedure is roughly as follows:&lt;/p>
&lt;ol>
&lt;li>A user builds a TOR circuit by selecting three nodes to relay messages and obtains a shared public encryption key from each of them. Let&amp;rsquo;s call them &lt;code>N1&lt;/code>, &lt;code>N2&lt;/code>, and &lt;code>N3&lt;/code>&lt;/li>
&lt;li>The user encrypts the private message that she wants to send in three layers. Let&amp;rsquo;s say encryption of a message M with public key &lt;code>i&lt;/code> is denoted by &lt;code>fi(M)&lt;/code>. The user sends a message M1 that looks like &lt;code>f1(f2(f3(M)))&lt;/code> to relay node N1.&lt;/li>
&lt;li>node 1 decrypts the wrapped message into the internal message. it reads that the next node in sequence is N2. node 1 forwards the contents of the message M1 to N2.&lt;/li>
&lt;li>node 2 decrypts the wrapped message into the internal message. it reads that the next node in sequence is N3. node 2 forwards the contents of the message M2 to N3.&lt;/li>
&lt;li>node 3 decrypts the wrapped message into the internal message. it reads that the next node in sequence is the intended recipient (R). node 3 forwards the contents of M2 to recipient R.&lt;/li>
&lt;li>the recipient receives the message and decrypts it. Recipient now has the option to reply and each relay node can rewrap messages at each step of the way to send messages back&lt;/li>
&lt;/ol>
&lt;p>What&amp;rsquo;s great about this is that at each step, the sender only knows the alias for the service, and each relay only knows about their immediate neighbors in the message. Even better, the receiver knows nothing about the sender except the contents of the message, and the sender knows nothing about the receiver except the alias of the service and the responses that they send.&lt;/p>
&lt;p>&lt;img src="https://varunhegde.com/images/onions/who_knows_what.png#center" alt="who knows what in a TOR relay">&lt;/p>
&lt;p>Let&amp;rsquo;s assume now that there is some globally omniscient observer that can view every connection and every message in and out of every network node.&lt;/p>
&lt;p>If the communicators use TOR, then it becomes impossible for said observer to correlate the first wrapped message with the final message sent to the receiver unless the observer can control all of the relay nodes.&lt;/p>
&lt;p>&lt;img src="https://varunhegde.com/images/onions/global-observer.png#center" alt="global observers">&lt;/p>
&lt;h2 id="operational-security-opsec-considerations">Operational Security (OpSec) Considerations&lt;/h2>
&lt;p>Before we start, you might ask: why do I need this? Presumably you want to avoid letting bad actors (read: Facebook? OpenAI? Hostile governments?) surveil you. You also want to avoid exposing yourself to attacks and theft.&lt;/p>
&lt;p>There are people that write multiple textbooks on the topic of opsec. I can&amp;rsquo;t possibly cover enough to save you from yourself, and frankly I bet I don&amp;rsquo;t know enough to protect myself 100% (does anybody really?). With that caveat in mind, read the below posts to maintain your safety when browing the dark web. As it turns out, these tips also are applicable to the regular web.&lt;/p>
&lt;p>You should start with reading the basics on this &lt;a href="https://gist.github.com/vil/7dfdb362d3aef91183101c300da3c543">GitHub gist&lt;/a> and &lt;a href="https://www.reddit.com/r/TOR/comments/3dq1pg/here_is_a_quick_guide_to_using_tor_opsec/">this Reddit post&lt;/a>. Obviously, there are levels to this, and the only people that truly need to go to the furthest extremes are probably sending messages via carrier pigeon.&lt;/p>
&lt;h3 id="common-mistakes-that-put-you-at-risk">Common mistakes that put you at risk&lt;/h3>
&lt;ul>
&lt;li>ID re-use: reusing the same usernames, e-mail addresses or passwords across difference services and platforms can make you more susceptible to risk. If one service gets hacked or has bad security practices, your sensitive access credentials are exposed more broadly&lt;/li>
&lt;li>Weak authentication: simple passwords or lack of multi-factor authentication (MFA) can leave you open to simple brute force attacks or other exploits. I won&amp;rsquo;t get into how modern password security is implemented - you can do some digging on that.&lt;/li>
&lt;li>sharing too much: just like saying too much makes you a target, revealing too much info online can do the same. Real g&amp;rsquo;s move in silence like lasagna.&lt;/li>
&lt;li>predictable routines: go to the same website every day at the same time? anybody snooping your traffic migyht be able to identify you by using that information&lt;/li>
&lt;/ul>
&lt;h2 id="frequently-asked-questions">Frequently Asked Questions&lt;/h2>
&lt;p>I told my friends that I&amp;rsquo;m writing something about this and heard a suite of questions&amp;hellip; some better than others.&lt;/p>
&lt;h3 id="1-is-the-dark-web-illegal">1. Is the dark web illegal?&lt;/h3>
&lt;p>I cannot emphasize this enough: &lt;em>&lt;strong>this is not legal advice and is merely my personal, non-professional opinion&lt;/strong>&lt;/em>. That said, the legality of the dark web depends on jurisdiction. If you live in the United States and are not otherwise committing a crime, the law as of the time of this posting suggests that you are not committing any crimes.&lt;/p>
&lt;p>In other countries, this is not necessarily the case. It&amp;rsquo;s possible that using TOR services can put you at significant personal risk. For example, I can imagine certain east Asian countries might not be too thrilled.&lt;/p>
&lt;h3 id="2-can-the-police-track-tor-usage">2. Can the police track TOR usage?&lt;/h3>
&lt;p>I cannot emphasize this enough: &lt;em>&lt;strong>this is not legal advice and is merely my personal, non-professional opinion&lt;/strong>&lt;/em>. In theory, it is absolutely possible for police to track TOR usage. In practice, this &lt;em>&lt;strong>should&lt;/strong>&lt;/em> be exceedingly difficult because anonymous and uncompromised relay hosts probably aren&amp;rsquo;t actively sharing your messages with law enforcement. If you&amp;rsquo;re worried that the police are using their limited resources to track your TOR usage, it&amp;rsquo;s probably more likely that they&amp;rsquo;re using higher quality and lower tech exploits (every heard of wire-tapping warrants?).&lt;/p>
&lt;h3 id="3-whats-the-difference-between-tor-and-vpns">3. What&amp;rsquo;s the difference between Tor and VPNs?&lt;/h3>
&lt;p>A Virtual Private Network (VPN) acts as an overlay to pass messages between participants. The VPN generally has a known set of relays or at least a set of relays that are all controlled by the same provider. As such, compromising the intermediary means compromising your privacy.&lt;/p>
&lt;p>TOR is a protocol that passes messages between relay circuit members in a big game of telephone. The relay strategy allows roughly anonymous communication between parties without requireing a dedicated and trusted third-party to act as the relay.&lt;/p>
&lt;h2 id="conclusion">Conclusion&lt;/h2>
&lt;p>One of my favorite authors, Douglas Adams, says Space is big. You just won&amp;rsquo;t believe how vastly, hugely, mind-bogglingly big it is. I mean, you may think it&amp;rsquo;s a long way down the road to the chemist&amp;rsquo;s, but that&amp;rsquo;s just peanuts to space.”&lt;/p>
&lt;p>Even if I&amp;rsquo;m not as entertaining as the Hitchiker&amp;rsquo;s Guide to the Galaxy, I can at least say that the internet is pretty big too. Most of us are exposed to the internet via public services, but that&amp;rsquo;s just the tip of the iceberg. There is just so much data hidden below the surface, just out of reach, and hidden deep so nobody can find it without know it&amp;rsquo;s there.&lt;/p>
&lt;p>It might seem dark and mysterious, but at the end of the day these deep and hidden services are just another evolution of the data privacy craze that started millennia ago with basic cyphers and has extended into the data age. I&amp;rsquo;m 100% confident that encryption will continue to evolve in ways that we could never have imagined.&lt;/p>
&lt;p>In the meantime, let me know if there&amp;rsquo;s any topics around the dark web that you&amp;rsquo;d like me to dig into in further detail.&lt;/p></description></item><item><title>What’s a Party Without Potion?</title><link>https://varunhegde.com/whats-a-party-without-potion/</link><pubDate>Mon, 28 Jul 2025 22:00:32 -0400</pubDate><author>varun@varunhegde.com (Varun Hegde)</author><guid>https://varunhegde.com/whats-a-party-without-potion/</guid><description>&lt;p>If you read my previous post about the latest homebrew batch of mead, you&amp;rsquo;ll remember that it was called &lt;a href="https://varunhegde.com/alchemists-gold">Alchemist&amp;rsquo;s Gold&lt;/a> so that it could be drank at a medieval themed bachelor party. We had a lot of mead in the bucket. Enough, in fact, to make 22 bottles. That&amp;rsquo;s a case and a half for the bacholor party and 6 for me to take to events or drink at home.&lt;/p>
&lt;p>We did a little product photoshoot at home to show the final product in detail.&lt;/p>
&lt;p>&lt;img src="https://varunhegde.com/images/mead2/DSC_7134.JPEG#center" alt="Alchemists Gold">&lt;/p>
&lt;p>I&amp;rsquo;m very proud of how these photos turned out.&lt;/p>
&lt;h2 id="patience-is-a-virtue-that-i-do-not-possess">Patience is a Virtue (that I do not possess)&lt;/h2>
&lt;p>In my previous post, I mentioned that the yeast that I used (&lt;a href="https://www.lallemandbrewing.com/en/united-states/products/lalvin-ec-1118/">EC-1118&lt;/a>) could theoretically achieve very robust fermentation at high ABV. I also mentioned that it wasn&amp;rsquo;t 100% clear that the stabilizers I added were working since there was still additional gas in the airlock. Pasteurization could have possibly helped at the risk of changing up the flavor profile.&lt;/p>
&lt;p>I bottled it up anyways.&lt;/p>
&lt;p>If there was continued fermentation, then I&amp;rsquo;ve just gifted Sam a case full of exploding bottles. On the other hand, if the stabilizers worked, then no problem, and the residual sugars left in the mead would taste pretty good.&lt;/p>
&lt;p>Thankfully, nothing exploded in transit to the West Coast and the bottles arrived safely for their last adventure.&lt;/p>
&lt;p>&lt;img src="https://varunhegde.com/images/mead2/IMG_5047.JPEG#centerSmall" alt="Mead at Party">&lt;/p>
&lt;h2 id="my-cup-runneth-over">My Cup Runneth Over&lt;/h2>
&lt;p>From what I heard, the mead was well received&amp;hellip; very well received. The bachelor enjoyed the mead and it looks like they made some new friends too.&lt;/p>
&lt;p>&lt;img src="https://varunhegde.com/images/mead2/IMG_9932.JPEG#center" alt="Noah with a Bottle">&lt;/p>
&lt;p>Truly a bachelor party fit for royalty.&lt;/p>
&lt;p>Did I mention they had hatchets? Nobody was harmed during this process, which is pretty impressive when you consider bachelor parties&amp;rsquo; normal track records&lt;/p>
&lt;p>&lt;img src="https://varunhegde.com/images/mead2/IMG_9927.JPEG#center" alt="Sam with some Hatchets">&lt;/p>
&lt;p>I think I can consider myself pretty fortunate that my friends appreciate my unusual hobbies.&lt;/p></description></item><item><title>Alchemists Gold</title><link>https://varunhegde.com/alchemists-gold/</link><pubDate>Tue, 27 May 2025 22:16:15 -0400</pubDate><author>varun@varunhegde.com (Varun Hegde)</author><guid>https://varunhegde.com/alchemists-gold/</guid><description>&lt;h2 id="the-birth-of-alchemists-gold">The Birth of Alchemist&amp;rsquo;s Gold&lt;/h2>
&lt;p>&lt;a href="https://varunhegde.com/homebrew-no-5">Homebrew No. 5&lt;/a>, the latest large batch of mead, finally has a name! My buddy is going to a medieval themed bachelor party. What goes better for the medieval theme than your own brew?! Of course, since I&amp;rsquo;m not actually a huge fan of the last batch in it&amp;rsquo;s fully dry state, I&amp;rsquo;ve decided to backsweeten the batch, in the process creating&amp;hellip; ALCHEMIST&amp;rsquo;S GOLD. Ok Sam came up with with that name, but I do think it&amp;rsquo;s fitting for a mead that is magically golden and a sweet transmutation of honey into something you can drink any time.&lt;/p>
&lt;p>Here&amp;rsquo;s the label&lt;/p>
&lt;p>&lt;img src="https://varunhegde.com/images/mead/AlchemistsGold.jpg#center" alt="Alchemists Gold">&lt;/p>
&lt;p>You may notice the unusual 12% to 15% ABV at the bottom. It&amp;rsquo;s not a typo, I actually am not entirely sure what the actual ABV is. If the pre-stabilization mead fully digested all residual sugar, I&amp;rsquo;d expect the base brew to be 12% ABV. Unfortunately, stabilization is a challenging beast.&lt;/p>
&lt;h2 id="stabilization-procedure">Stabilization Procedure&lt;/h2>
&lt;p>&lt;img src="https://varunhegde.com/images/mead/backsweetener.jpg#center" alt="More Sugar">&lt;/p>
&lt;p>I loosely followed stabilization processes that I found online.&lt;/p>
&lt;ol>
&lt;li>Start with a completed secondary ferment&lt;/li>
&lt;li>Stir in 1/2 tsp/gal of potassium metabisulfite - is this too much? Please comment if you have expertise on this&lt;/li>
&lt;li>Wait 24h&lt;/li>
&lt;li>Stir in 1/2 tsp/gal of potassium sorbate&lt;/li>
&lt;li>Wait another 24h&lt;/li>
&lt;li>Stir in 8 oz of Crystal&amp;rsquo;s Wildflower Honey to backsweeten. The picture above is waht the backsweeteening must looks like. Maybe I should have pasteurized this?&lt;/li>
&lt;/ol>
&lt;p>You&amp;rsquo;d think this should work and we should have a sweeter 12% wine. You&amp;rsquo;d think wrongly though.&lt;/p>
&lt;h2 id="my-mead-could-use-a-therapist----its-alive-and-unstable">My Mead Could Use a Therapist &amp;ndash; It&amp;rsquo;s Alive and Unstable&lt;/h2>
&lt;p>The wine in the bucket seems to be fermenting again.&lt;/p>
&lt;p>&lt;img src="https://varunhegde.com/images/mead/IMG_4523_out.jpg#center" alt="Fermentation Nation">&lt;/p>
&lt;p>For the uninitiated, the picture on the left is a picture of the airlock on the bucket of mead, directly after stabilization and backsweetening. You can see the water levels in the two bulbs. The left side is slightly below the &amp;ldquo;max&amp;rdquo; level and the right side is slightly above. Six hours later, the water levels are reversed. That&amp;rsquo;s a sign that the backsweetened mixture is offgassing something, but there shouldn&amp;rsquo;t be any new gases generated. Unless of course, the yeast in the mead has started fermenting all over again. Great.&lt;/p>
&lt;p>The water level is obviously changing as a result of increased air pressure in the headspace in the bucket. Or is it? For you Physics PhDs, there are actually alternatives that could cause the increased air pressure. None of them have me particularly convinced but I might as well include them.&lt;/p>
&lt;ol>
&lt;li>Fermentation creates CO2. Some of that CO2 dissolves, especially under pressure. It&amp;rsquo;s been over 15 years since I&amp;rsquo;ve done any of that math, so &lt;del>I&amp;rsquo;ll leave it to you as an exercise&lt;/del> I&amp;rsquo;m going to just spitball here. The previously fully fermented mead has no residual sugar left so the stabilized mead has no nucleation points for CO2 to fall out of solution. When I add the honey and mix, this disturbs the solution and causes some to bubble out.&lt;/li>
&lt;li>When I open the container, the air pressure equalizes to the room. During the day, the bucket heats up just enough to force vapor pressure higher in the bucket. Again, been over a decade since I did this math, but hopefully somebody else can chime in.&lt;/li>
&lt;/ol>
&lt;h2 id="oh-the-places-youll-go">Oh the Places You&amp;rsquo;ll Go&lt;/h2>
&lt;p>This batch tastes much better once backsweetened. Because of the additional residual sugar and the possibility of continued fermentation, I&amp;rsquo;m actually not confident it will stop at 15% ABV. &lt;a href="https://www.lallemandbrewing.com/en/united-states/products/lalvin-ec-1118/">EC-1118&lt;/a>, the yeast that I used for fermentation is known for &amp;ldquo;robust and reliable fermentation&amp;rdquo; especially when fermenting the base for sparkling wines and for in bottle fermentation to make a wine fizzy. In other words - this yeast will knock your socks off. In fact, the yeast is alcohol tolerant up to 18% ABV.&lt;/p>
&lt;p>Since I know I&amp;rsquo;ll need to bottle before mid-June and it&amp;rsquo;s clear that the stabilizers aren&amp;rsquo;t doing their job correctly, I&amp;rsquo;ll probably need pasteurize this batch. If anybody knows if it&amp;rsquo;s a bad idea to use my sous vide machine on mead, please comment.&lt;/p></description></item><item><title>Homebrew No 5</title><link>https://varunhegde.com/homebrew-no-5/</link><pubDate>Mon, 19 May 2025 18:01:06 -0400</pubDate><author>varun@varunhegde.com (Varun Hegde)</author><guid>https://varunhegde.com/homebrew-no-5/</guid><description>&lt;p>Just cracked open batch number 5 of my homebrewed mead. I initially made two batches of mead using &lt;a href="https://crystalsrawhoney.com/products/20lb-raw-honey-pail-copy">Crystal&amp;rsquo;s Wildflower Honey&lt;/a>. The must of the first batch was made with 10 lbs of honey, 1 quart of maple syrup, and enough water to fill the remainder of volume of a 5 gallon HDPE food-safe bucket. I used 2 packets of Lalvin EC-1118 to get the fermentation process going. The must very obviously could have just used 1 packet though.&lt;/p>
&lt;p>The first batch ran through primary ferment slowly compared to the last few batches that I made. I suspect that this is the largest batch I&amp;rsquo;ve made to date. All the early batches were 1 gallon total volume. I also started this batch during late summer&amp;hellip; not sure if the A/C made the fermentation run at a lower temperature or not. This batch ran primary fermentation from July &amp;lsquo;24 to December &amp;lsquo;24. After December &amp;lsquo;24, I re-racked the bucket into anotehr 5 gallon bucket for a secondary fermentation. I&amp;rsquo;ve added no stabilizers so far.&lt;/p>
&lt;p>As a result of the extended primary fermentation and now 6-month secondary fermentation, there should be effectively zero residual sugar left in the batch. I didn&amp;rsquo;t do specific gravity testing before or after the fermentation completed. Assuming that the wine is fermented to zero residual sugar, back of the envelope calculations suggest that this batch yields 12% to 15% ABV.&lt;/p>
&lt;h2 id="i-do-not-envy-you-the-headache-you-will-have-when-you-awaken">I do not envy you the headache you will have when you awaken&lt;/h2>
&lt;p>The mead aroma is honey forward with a virant plank-like smell. At room temperature its almost a &amp;ldquo;thin&amp;rdquo; smell. I don&amp;rsquo;t love the smell to be honest. The raw honey is obviously really sweet to the nose. It has a rich golden color and almost a sandy this texture on account of the water evaporation over time.&lt;/p>
&lt;p>On tasting: very dry. Zero residual sugar or sweetness. The mead tasted grassy and floral. I can&amp;rsquo;t tell if this is the wildflowerness of the thoney. It&amp;rsquo;s very aggressive on the nose from the high alcohol content too. You can easily taste how strong this batch is.&lt;/p>
&lt;p>The raw honey itself is sweet but not cloyingly so. It has a depth to it. There&amp;rsquo;s a sandy texture that I mentioned before, which seems to be caused by sugar crystallizing out of suspension. Honestly as far as honey goes this is pretty neutral and doesn&amp;rsquo;t really have a specific taste outside the green and floral aromas. Those influence the taste profile, but not too much&lt;/p>
&lt;p>I tried backsweetening the room temperature mead with 1 teaspoon of honey to two teaspoons of mead. It was obviously sweeter. Was it better? Sugar is basically crack, so yes. However, after backsweetening the taste mostly became honey forward with no subtlety. Sure, the sweetness of the honey added depth but it became almost boring.&lt;/p>
&lt;p>When chilled, the mead is still very floral scented, but the grassy taste completely disappears. It tastes better to me, but also very similar to a basic Pinot Grigio. If you&amp;rsquo;re allergic to grapes or gluten then this would be a very refreshing drink on a hot Summer day.&lt;/p>
&lt;h2 id="the-experimental-batch">The Experimental Batch&lt;/h2>
&lt;p>What it &lt;strong>really needs&lt;/strong> is secondary flavor and aromatic depth. Something that piques your interest and takes this fr oma good drink for abeer pong to the kind of drink you could bring to a summer barbecue (barbeque? idk, not my wheelhouse).&lt;/p>
&lt;p>As it turns out, Jessie actually had the bright idea to do this before we even started. Our last experimental batch of mead (1 gallon, roughly 5 bottles) used a significant portion of maple syrup. We ran out of that so the big batch didn&amp;rsquo;t have enough of that, and the experimental batch needed another treatment. Jessie picked an amazing combo of lemon zest and Earl Grey tea, and I tossed in some juniper berries for a little kick.&lt;/p>
&lt;p>This experimental batch (name TBD) was easily our best batch ever. It was so good that we actually don&amp;rsquo;t know how it tastes after secondary ferment and againg because we drank the whole gallon batch on New Year&amp;rsquo;s Eve with our buddies. We&amp;rsquo;re definitely going to make another one of those some time.&lt;/p></description></item><item><title>Cabin Fever III - Basic Plans</title><link>https://varunhegde.com/cabin-fever-iii-basic-plans/</link><pubDate>Tue, 25 Mar 2025 22:00:28 -0400</pubDate><author>varun@varunhegde.com (Varun Hegde)</author><guid>https://varunhegde.com/cabin-fever-iii-basic-plans/</guid><description>&lt;p>I&amp;rsquo;ve spent the better part of a month doing some basic design work on the cabin I mentioned in &lt;a href="https://varunhegde.com/cabin-fever">Cabin Fever&lt;/a> and &lt;a href="https://varunhegde.com/cabin-fever-ii-my-cabin">Part II&lt;/a>&lt;/p>
&lt;p>If you get seasick or motion sick, I&amp;rsquo;m sorry&amp;hellip; take your daily dose of dramamine right now. If you have tips on how to make better animations in SketchUp 2017, let me know!&lt;/p>
&lt;h2 id="summary">Summary&lt;/h2>
&lt;p>The cabin will be a basic stick frame structure supported 4&amp;rsquo; above the ground on concrete piers and posts. It&amp;rsquo;ll measure 19.5&amp;rsquo; wide by 30&amp;rsquo; deep. The rim joists and floor joists will be 2x10s. Exterior walls will be framed with 2x6s while interior walls will (tentatively) be framed with 2x4s.&lt;/p>
&lt;p>The ceiling joists, rafters and ridge will be 2x6s as well.&lt;/p>
&lt;p>Both roof and side sheathing will be OSB or Zip-R, haven&amp;rsquo;t decided yet, but I suppose it really boils down to preference.&lt;/p>
&lt;p>Deck support posts will be 6x6 and the decking itself will either have to be cedar or pine. I&amp;rsquo;m cheap, probably pine.&lt;/p>
&lt;p>I&amp;rsquo;ll go into more details in future posts. With the high level out of the way - here&amp;rsquo;s the animated design:&lt;/p>
&lt;p>&lt;img src="https://varunhegde.com/images/cabin/cabin-rotating.gif" alt="cabin-rotating">&lt;/p>
&lt;p>And some other still views of the building&lt;/p>
&lt;table>
&lt;tr>
&lt;th colspan=3>Somewhat Isometric&lt;/th>
&lt;/tr>
&lt;tr>
&lt;td colspan=3>
&lt;img src="https://varunhegde.com/images/cabin/cabin_iso2.png">
&lt;/td>
&lt;/tr>
&lt;tr>
&lt;th>Front&lt;/th>
&lt;th>Side&lt;/th>
&lt;th>Bottom&lt;/th>
&lt;/tr>
&lt;tr>
&lt;td>
&lt;img src="https://varunhegde.com/images/cabin/front.png">
&lt;/td>
&lt;td>
&lt;img src="https://varunhegde.com/images/cabin/side.png">
&lt;/td>
&lt;td>
&lt;img src="https://varunhegde.com/images/cabin/bottom.png">
&lt;/td>
&lt;/tr>
&lt;/table></description></item><item><title>Cabin Fever II - My Cabin</title><link>https://varunhegde.com/cabin-fever-ii-my-cabin/</link><pubDate>Wed, 19 Feb 2025 21:20:55 -0500</pubDate><author>varun@varunhegde.com (Varun Hegde)</author><guid>https://varunhegde.com/cabin-fever-ii-my-cabin/</guid><description>&lt;p>Last month I mentioned in &lt;a href="https://varunhegde.com/cabin-fever">Cabin Fever&lt;/a> that it&amp;rsquo;s not worth the effort of letting envy take hold of you. Ironically, I&amp;rsquo;ve since done exactly that in the most literal of ways.&lt;/p>
&lt;p>I watched dozens of hours of men building remote cabins in remote corners of Wisconsin in one of my latest YouTube binges. This guy &lt;a href="https://www.youtube.com/@Bushradical">@Bushradical&lt;/a> has got me particularly jealous. To be fair, I know nothing of the rest of his life, but the pieces he publicizes look positively serene.&lt;/p>
&lt;p>Not surprisingly, I now want a remote cabin of my own to spend time reflecting at. As any other man on a budget who grew up building with Lego sets, I want to build it myself. Is this nuts? Maybe, but a little planning and a can do attitude go a long way. Worst case scenario - I ordered duct tape from Costco that one time. I&amp;rsquo;ve heard that the tensile strength of duct tape is pretty much on par with concrete so no problem (MechE&amp;rsquo;s out there - I&amp;rsquo;m kidding!).&lt;/p>
&lt;p>From here on out I&amp;rsquo;m going to post progress updates on my designs and plans. There are a couple separate stages, but for now I think it makes sense to start with a very basic cabin design alongside some basic criteria for the land on which to build. With these two sets of plans in mind, I&amp;rsquo;ll actually be able to search and implement.&lt;/p>
&lt;h2 id="location">Location&lt;/h2>
&lt;p>I&amp;rsquo;d like it to be close enough to go for a weekend trip, but far enough to be inconvenient for a casual non-emergency day trip. Maybe 2-3 hours driving from Manhattan, NYC. That leaves the Hudson Valley and maybe some areas of Pennsylvania near the Catskills/Adirondacks.&lt;/p>
&lt;h3 id="base-characteristics-of-the-plot">Base Characteristics of the Plot&lt;/h3>
&lt;ul>
&lt;li>Size - I grew up on a quarter acre in the suburbs. I currently live in a 750 square foot apartment. I&amp;rsquo;m leaning towards between 5 and 50 acres. The wildly large range should allow flexibility for availability. Sometimes there just aren&amp;rsquo;t plots available for sale at a reasonble price for the value.&lt;/li>
&lt;li>Slope - There needs to be at least 1 flattish section, and ideally enough flat areas to provide a long driveway to a secluded clearing for the cabin.&lt;/li>
&lt;/ul>
&lt;h3 id="utilities">Utilities&lt;/h3>
&lt;p>Needs to be accessible to electric or solar. Gas would be great. Must be able to do provide clean running water and plumbing, whether well+septic or public services.&lt;/p>
&lt;h3 id="the-vibes-characteristics-of-the-plot">The Vibes (Characteristics of the Plot)&lt;/h3>
&lt;p>I love the fall colors and the privacy that tree cover provides. Therefore the space should be wooded with ample full grown trees. A mix of maple and some native evergreen types would be conducive to keeping the plot sustainable, but I&amp;rsquo;ll need to consult with local forestry organizations.&lt;/p>
&lt;p>I could use a small stream to power some primitive tooling while figuring out longer term power. The stream could also just be purely cosmetic, I&amp;rsquo;m not too picky.&lt;/p>
&lt;h2 id="budget">Budget&lt;/h2>
&lt;ul>
&lt;li>Land - I haven&amp;rsquo;t thought this far ahead. For now, it&amp;rsquo;s just a pipe dream. Some rough math suggests that the land alone is going to be between $10k and $50k per acre. For the sake of argument let&amp;rsquo;s pick $20k and say that nets to $200k for 10 acres.&lt;/li>
&lt;li>Materials - expecting between and $100 per square foot, but I&amp;rsquo;m going to complete my design and actually cost out every piece with overages to get a proper budget.
&lt;ul>
&lt;li>For a very rough 200 sq. ft. cabin (10&amp;rsquo;x20&amp;rsquo;), I should expect at least $20k in materials cost.&lt;/li>
&lt;li>For a 1200 sq. ft. cabin (30&amp;rsquo;x40&amp;rsquo;), therefore, I should expect to spend $120k in materials alone.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h2 id="endnote">Endnote&lt;/h2>
&lt;p>I&amp;rsquo;ll probably need a pickup truck at some point. Never thought that would be the first car I&amp;rsquo;d buy, but it would be pretty funny so I&amp;rsquo;m writing it down for posterity&amp;rsquo;s sake.&lt;/p></description></item><item><title>Cabin Fever</title><link>https://varunhegde.com/cabin-fever/</link><pubDate>Sun, 05 Jan 2025 15:35:01 -0500</pubDate><author>varun@varunhegde.com (Varun Hegde)</author><guid>https://varunhegde.com/cabin-fever/</guid><description>&lt;p>I read a piece by Shalom Auslander on &lt;a href="https://open.substack.com/pub/shalomauslander/p/the-dangers-of-substack-for-the-chronically">the Dangers of Substack for the Chronically Low Self-Esteemed&lt;/a> recently. The piece is from Aug &amp;lsquo;23 but it&amp;rsquo;s timeless nonetheless.&lt;/p>
&lt;p>Auslander explores his envy via George Saunders&amp;rsquo; literal cabin (and retreats) as a symbol for Saunders&amp;rsquo; relative success. Somewhere in his particularly vulgar and entertaining writing style, you can tease out partial attribution of the envy to just being a writer. I think it&amp;rsquo;s deeper than that.&lt;/p>
&lt;p>Comparison to the most extreme right tail is just part of being a person. It&amp;rsquo;s really easy to compare oneself to the most successful ideal outcome - we&amp;rsquo;re pretty well attuned to what that could be.&lt;/p>
&lt;p>Maybe Shalom can be better at suppressing the envy, maybe he can&amp;rsquo;t.&lt;/p>
&lt;p>I say it&amp;rsquo;s not even worth the effort.&lt;/p>
&lt;p>Better to just take the emotion and speedrun through it. I feel peace if I can quickly get through the following questions and get back to life:&lt;/p>
&lt;ul>
&lt;li>are they more successful?&lt;/li>
&lt;li>are they satisfied?&lt;/li>
&lt;li>am I good at what I do?&lt;/li>
&lt;li>am I satisfied?&lt;/li>
&lt;li>Will I ever be satisfied?&lt;/li>
&lt;li>what would get me there?&lt;/li>
&lt;/ul>
&lt;p>I bet there&amp;rsquo;s a strong correlation of envy to cabin fever. When you&amp;rsquo;re stuck inside or in isolation more often, you have way more time to compare yourself to others.&lt;/p>
&lt;p>I don&amp;rsquo;t want to sound like a self-help book, but my best bet is that if Shalom leaves Panera and gets more exposure to the world outside of the cabin, he&amp;rsquo;ll worry less about whether his follow count is high enough.&lt;/p></description></item><item><title>Backing Away From Brainrot</title><link>https://varunhegde.com/backing-away-from-brainrot/</link><pubDate>Wed, 01 Jan 2025 12:53:13 -0500</pubDate><author>varun@varunhegde.com (Varun Hegde)</author><guid>https://varunhegde.com/backing-away-from-brainrot/</guid><description>&lt;p>Lately, I&amp;rsquo;ve felt a cognitive dullness&amp;hellip; a brainrot. I don&amp;rsquo;t have an exact definition for &lt;em>brainrot&lt;/em> and I&amp;rsquo;m not going to check a dictionary. No need, if you&amp;rsquo;ve been living in the modern world you know exactly what I mean.&lt;/p>
&lt;p>That feeling might just be a part of aging, but I suspect that it&amp;rsquo;s really driven by consuming short videos from whatever the service-du-jour is (TikTok, SnapChat, Facebook, Instagram, Twitter, YouTube, etc.). For what it&amp;rsquo;s worth, a lot of experts agree with me too.&lt;/p>
&lt;p>So, I&amp;rsquo;m going to start producing, instead of consuming. I couldn&amp;rsquo;t care less if anybody actually reads my work, so let&amp;rsquo;s call it public (non-)consumption.&lt;/p>
&lt;h2 id="why-public-non-consumption">why public (non-)consumption?&lt;/h2>
&lt;p>You might wonder why anybody would publish publicly with no expectation of feedback.&lt;/p>
&lt;p>I&amp;rsquo;ve been a content &lt;em>consumer&lt;/em> for as long as I can remember. It just makes sense. When you&amp;rsquo;re growing up, and you&amp;rsquo;re brain is developing, your theories may be original to you. Those ideas will lack context though. Most people would find them naive or lacking nuance. And so you are encouraged to learn by reading others&amp;rsquo; work, looking at photos, watching videos, and so on.&lt;/p>
&lt;p>Now I&amp;rsquo;m slightly older, and I hope that the very act of curating and focusing my thoughts for public (non-)consumption will help develop mental acuity and reverse the brainrot.&lt;/p></description></item></channel></rss>