Alex's profileYet Another Personal Spa...PhotosBlogListsMore ![]() | Help |
Yet Another Personal SpaceStuff to read in case if you have a minute or two January, 2008 Mailing List Data Mining - The MathEnough games. Let’s get serious for a second and do some math. We’ll start with an illustration that represents the state of a mail list with 500+ contributors over one year time period (20,000+ emails plotted). Technically it’s a function of two parameters F(T,C). The horizontal axis is time (T), when the vertical axis represents contributors (C). In this particular case F(T,C) is MailSent (T,C), where: MailSent (T,C) = 1, when the contributor C sent an email at time T; 0 otherwise. You can think of it as a forest of equally tall trees (height = 1). Ideally we had to plot it as a 3D graph, which would have a bunch of dots on one surface (Z = 1). The picture we have here is the view from the top. We could introduce the third parameter E (email), but for this research we don’t really care which particular email was sent. We are only interested in the fact of sending. In a similar manner we’ll introduce ThreadStarted(T, C), ThreadJoined(T,C) and other functions (the complete list of functions will follow in one of the next chapters). Again, we are not interested in which particular thread was started; we only need to know who started it and when. With a tiny tweak we introduce: TrafficGenerated(T,C) = N, when the contributor C started a thread at T and the thread had grew up to N emails; 0 otherwise (started no threads). In this case, it’s a forest of trees with different height. Each tree is representing one thread and its height depends on the thread length. The tallest tree will show us the largest thread. Using the same logic we will add: Audience(T,C) = M, when the contributor C started a thread at T and M other contributors joined the thread; 0 otherwise (started no threads). For any F(T,C) and a fixed time interval [T1, T2) we could introduce a set of cumulative functions: 1) Contributor's Total: Example: when F is EmailsSent and [T1, T2) = 'year 2007' it would give us the total number of emails sent from the selected contributor in 2007. 2) Mail List Total: Where N is the total number of the mail list contributors. Example: the total number of emails sent to the mail list in 2007. 3) Contributor’s Share:
Example: 5% of all emails in 2007 were sent from the contributor C. 3) Contributor’s Rate For any Ti within our fixed time interval [T1, T2) all contributors can be divided in two groups: those who joined the mail list before Ti (“veterans”) and those who started after Ti (“rookies”). Let’s define Ti as time when the contributor i sent the first email within our interval (T1 ≤ Ti < T2). Then: Example: in 2007 the contributor C was sending 3.5 emails per day on average. If the time delta is measured in days it’s a daily rate function, if we measure it in hours – it’s an hourly rate function, etc. December, 2007 Mailing List Data Mining - The GameIf you think of it as a game, we would have the following rules: 1) You get 1 point for sending an email to the mail list 2) If you start a thread and it grows up to N emails long, you get N points We’ll distinguish two types of contributors: the ones who start the threads and the ones who send replies. To use another analogy, we’ll call them architects and builders. Architects come up with some design specs, the master plans (starting new threads) and the builders implement it (filling the thread up with emails). Some master plans might never be implemented (no replies for your threads). So we’ll treat all architects as builders and we’ll count the thread starting email as a reply. Of course, the builders can be architects as well. What’s the goal of the game? To make some noise, of course! Let the subscribers see how smart you are, what a great speaker you are and how many brilliant ideas you have. Whatever it takes make them remember your name! Let’s just say the goal of this game is to get more points. Some hints for the winners: 1) Be a good architect. You can start a few, but very provocative and controversial threads, making a lot of contributors participating and, eventually, earning points for you. Or you start a lot of not-so-provocative threads – if anyone replies you’ll get at least one extra point per thread. 2) Be a hard working builder. You send tons of emails joining as many threads as you can. There’s a new thread? Jump on it! Keep it alive as long as you can. 3) Use 1 and 2 at the same time One more thing: thread hijacking. Every contributor has probably experienced it once: you start a new thread, someone replies changing the subject and there it goes – it’s not your thread anymore. But hey, if you didn’t start that thread, it wouldn’t have been hijacked. No worries, here you get the points. It’s a fair game. :] Ok, please scratch that irony and sarcasm out. Even though that’s how I got myself into this research – by getting curious why I remember some names and don’t remember others. But there are thousands of really useful mail lists. So many projects wouldn’t succeed or even survive without one. We could still use this model here to cheer the most valuable contributors. December, 2007 Mailing List Data Mining - TerminologyWe are going to use the following terminology (most of the terms are self-explainable anyway):
1) Mail list - a collection of names and addresses used by an individual to send material to multiple recipients. 2) Mail list archive – a collection of emails sent to the mail list over a time period in the past. 3) Email – in the context of this research it’s an email sent to the mail list. 4) Subscriber – a mail list reader 5) Contributor – a subscriber who sent at least one email to the mail list. 6) Thread (or topic, or conversation) – a collection of emails with the same subject. Each thread is started by one of the contributors who sent the very first (oldest) email in the thread. 7) Traffic – all emails in all threads started by a contributor. 8) Audience – all contributors participated in the threads started by a contributor. So what are we interested in? Where’s the cool stuff? Let’s start with some observations. What do contributors do? 1) Contributors send emails 2) Contributors start threads, which in turn implies more traffic 3) Contributors join threads started by other contributors 4) Contributors use a certain vocabulary December, 2007 Mailing List Data Mining - IntroductionData mining is hot these days. Social networks are even hotter. I really don't know how to measure the hotness of the social network data mining. Maybe infinitely hot will do it... If you want to get your hands dirty with the social network data mining it's easier then you think. You probably already have a good example of a social network right in your mailbox. I'm talking about mailing lists here. Last year I did some mining for one of the lists I'm subscribed to (just for fun!). I gotta tell you: there is a LOT of cool stuff you can dig out of it. And yes, it became an obsession for some time. I kept thinking how to model it in a better way, so I could come up with more interesting facts. And here's what I got... December, 2007 Applied Data MiningData mining is hot these days. And there are reasons why. · It’s fun · It tells you what’s inside that black box full of numbers. · It helps you to feel that the things are under control - if there are problems you know which one to fix first I like doing data mining. What I enjoy the most is what I call applied data mining. What I mean by that is when you take a real life data set you deal with on daily basis (boring stuff) and dig out some unexpected facts (cool stuff). Examples of such data sets: your mail box, your budget or even your movie collection. Who was the champion in sending emails to you last year? Who’s the most popular actor in your movie collection? How much money you left in your favorite sushi place over the past two years? (I wish I didn’t check this one…) What was the lowest temperature of your hard drive last year? You’ll be surprised how many interesting facts are hidden behind the numbers. But be careful. After doing this kind of exercise for a couple of times it might become an obsession. You might start hunting for new data sets to mine. Every row of numbers will start looking to you like another opportunity to find some unexpected facts (even if there’s nothing cool behind it :) ) |
||||
|
|