Alex's profileYet Another Personal Spa...PhotosBlogListsMore ![]() | Help |
|
January, 2008 Mailing List Data Mining - The MathEnough games. Let’s get serious for a second and do some math. We’ll start with an illustration that represents the state of a mail list with 500+ contributors over one year time period (20,000+ emails plotted). Technically it’s a function of two parameters F(T,C). The horizontal axis is time (T), when the vertical axis represents contributors (C). In this particular case F(T,C) is MailSent (T,C), where: MailSent (T,C) = 1, when the contributor C sent an email at time T; 0 otherwise. You can think of it as a forest of equally tall trees (height = 1). Ideally we had to plot it as a 3D graph, which would have a bunch of dots on one surface (Z = 1). The picture we have here is the view from the top. We could introduce the third parameter E (email), but for this research we don’t really care which particular email was sent. We are only interested in the fact of sending. In a similar manner we’ll introduce ThreadStarted(T, C), ThreadJoined(T,C) and other functions (the complete list of functions will follow in one of the next chapters). Again, we are not interested in which particular thread was started; we only need to know who started it and when. With a tiny tweak we introduce: TrafficGenerated(T,C) = N, when the contributor C started a thread at T and the thread had grew up to N emails; 0 otherwise (started no threads). In this case, it’s a forest of trees with different height. Each tree is representing one thread and its height depends on the thread length. The tallest tree will show us the largest thread. Using the same logic we will add: Audience(T,C) = M, when the contributor C started a thread at T and M other contributors joined the thread; 0 otherwise (started no threads). For any F(T,C) and a fixed time interval [T1, T2) we could introduce a set of cumulative functions: 1) Contributor's Total: Example: when F is EmailsSent and [T1, T2) = 'year 2007' it would give us the total number of emails sent from the selected contributor in 2007. 2) Mail List Total: Where N is the total number of the mail list contributors. Example: the total number of emails sent to the mail list in 2007. 3) Contributor’s Share:
Example: 5% of all emails in 2007 were sent from the contributor C. 3) Contributor’s Rate For any Ti within our fixed time interval [T1, T2) all contributors can be divided in two groups: those who joined the mail list before Ti (“veterans”) and those who started after Ti (“rookies”). Let’s define Ti as time when the contributor i sent the first email within our interval (T1 ≤ Ti < T2). Then: Example: in 2007 the contributor C was sending 3.5 emails per day on average. If the time delta is measured in days it’s a daily rate function, if we measure it in hours – it’s an hourly rate function, etc. December, 2007 Mailing List Data Mining - The GameIf you think of it as a game, we would have the following rules: 1) You get 1 point for sending an email to the mail list 2) If you start a thread and it grows up to N emails long, you get N points We’ll distinguish two types of contributors: the ones who start the threads and the ones who send replies. To use another analogy, we’ll call them architects and builders. Architects come up with some design specs, the master plans (starting new threads) and the builders implement it (filling the thread up with emails). Some master plans might never be implemented (no replies for your threads). So we’ll treat all architects as builders and we’ll count the thread starting email as a reply. Of course, the builders can be architects as well. What’s the goal of the game? To make some noise, of course! Let the subscribers see how smart you are, what a great speaker you are and how many brilliant ideas you have. Whatever it takes make them remember your name! Let’s just say the goal of this game is to get more points. Some hints for the winners: 1) Be a good architect. You can start a few, but very provocative and controversial threads, making a lot of contributors participating and, eventually, earning points for you. Or you start a lot of not-so-provocative threads – if anyone replies you’ll get at least one extra point per thread. 2) Be a hard working builder. You send tons of emails joining as many threads as you can. There’s a new thread? Jump on it! Keep it alive as long as you can. 3) Use 1 and 2 at the same time One more thing: thread hijacking. Every contributor has probably experienced it once: you start a new thread, someone replies changing the subject and there it goes – it’s not your thread anymore. But hey, if you didn’t start that thread, it wouldn’t have been hijacked. No worries, here you get the points. It’s a fair game. :] Ok, please scratch that irony and sarcasm out. Even though that’s how I got myself into this research – by getting curious why I remember some names and don’t remember others. But there are thousands of really useful mail lists. So many projects wouldn’t succeed or even survive without one. We could still use this model here to cheer the most valuable contributors. December, 2007 Mailing List Data Mining - TerminologyWe are going to use the following terminology (most of the terms are self-explainable anyway):
1) Mail list - a collection of names and addresses used by an individual to send material to multiple recipients. 2) Mail list archive – a collection of emails sent to the mail list over a time period in the past. 3) Email – in the context of this research it’s an email sent to the mail list. 4) Subscriber – a mail list reader 5) Contributor – a subscriber who sent at least one email to the mail list. 6) Thread (or topic, or conversation) – a collection of emails with the same subject. Each thread is started by one of the contributors who sent the very first (oldest) email in the thread. 7) Traffic – all emails in all threads started by a contributor. 8) Audience – all contributors participated in the threads started by a contributor. So what are we interested in? Where’s the cool stuff? Let’s start with some observations. What do contributors do? 1) Contributors send emails 2) Contributors start threads, which in turn implies more traffic 3) Contributors join threads started by other contributors 4) Contributors use a certain vocabulary December, 2007 Mailing List Data Mining - IntroductionData mining is hot these days. Social networks are even hotter. I really don't know how to measure the hotness of the social network data mining. Maybe infinitely hot will do it... If you want to get your hands dirty with the social network data mining it's easier then you think. You probably already have a good example of a social network right in your mailbox. I'm talking about mailing lists here. Last year I did some mining for one of the lists I'm subscribed to (just for fun!). I gotta tell you: there is a LOT of cool stuff you can dig out of it. And yes, it became an obsession for some time. I kept thinking how to model it in a better way, so I could come up with more interesting facts. And here's what I got... December, 2007 Applied Data MiningData mining is hot these days. And there are reasons why. · It’s fun · It tells you what’s inside that black box full of numbers. · It helps you to feel that the things are under control - if there are problems you know which one to fix first I like doing data mining. What I enjoy the most is what I call applied data mining. What I mean by that is when you take a real life data set you deal with on daily basis (boring stuff) and dig out some unexpected facts (cool stuff). Examples of such data sets: your mail box, your budget or even your movie collection. Who was the champion in sending emails to you last year? Who’s the most popular actor in your movie collection? How much money you left in your favorite sushi place over the past two years? (I wish I didn’t check this one…) What was the lowest temperature of your hard drive last year? You’ll be surprised how many interesting facts are hidden behind the numbers. But be careful. After doing this kind of exercise for a couple of times it might become an obsession. You might start hunting for new data sets to mine. Every row of numbers will start looking to you like another opportunity to find some unexpected facts (even if there’s nothing cool behind it :) ) May, 2006 Blog Content TransferI’ve been running my blog for the last 3 years. It started as my personal “thing’s I shouldn’t forget to read” list: with tons of information coming from all kind of sources, these days it’s easy to get lost and miss some stuff that really matters. I decided to keep a list of links and notes that would be cool to review when I have time. That’s how the blog was born. Initially it was targeted for only one reader – me. But at some point I realized that it might be interesting for my friends as well. And that’s how it became public. Soon after that, the first question came up. As we know there are hundreds (ok, maybe dozens) of web sites that can help you to become a blogger. Each one has many features, BUT there is one thing I check first: what are my options in notifying “the readers” about new stuff on the blog? Seriously, it is a big deal for me. I tried many approaches. 1) You can just assume that people come to your site once in a while. The problem here is that I’m lazy. I can add 10 awesome (as all my posts are!) entries in one day when I’m in a mood and have time. Or I can totally abandon the site for months when I’m busy (read: lazy). The second is usually the case. And people need to get some “fresh content” all the time. If they come to your site and there’s nothing new for a week, they assume it’s all over and will never come back. 2) Send an email/message when there’s something new. That can easily become annoying. “Oh, new entry on your blog? Guess what – I don’t care. And stop sending me that spam!” 3) Maintain an RSS feed. Probably the best option out of these three. Only people that really care about your stuff will add your feed to their RSS reader. Of course if they use one. I’m getting very close to the actual reason why I’m telling all this. It’s GLEAMS. Yes, gleams. If you happened to use Windows Live Messenger or MSN Messenger, you probably saw those little stars that shine next to some of your contacts from time to time. Usually it says something like: “Foo’s space was updated. Foo is happy to say…blah blah yada yada…[a couple of the first lines from the latest blog entry]”.
If you click on it, you'll see something like this:
And based on what I see it is super catchy. Psychologically it works very well: those flashing stars really take your attention. And that’s all “the publisher” needs. By the way, “the publisher” does not have to do anything to send those gleams. It’s all automated. Submitting a new entry will automatically generate a gleam for your contact in your friends’ messengers. So I loved this feature and that’s how the second question came up: how do I transfer my original awesome precious content to Space? I really like how it sounds… Apparently, my old blogging site (Blogger) and the new one (MSN Spaces) expose some APIs that can be used to play with your content (Metaweblog API for Space and Atom API for Blogger). I spent some time over the weekend and wrote a tool that helped me to transfer my data between two sites. In case if somebody else is excited about gleams as much as I am ;] I've decided to share a copy of BCTransfer (Blog Content Transfer). Please let me know if it works and especially if it does not work for your. I’ll be more than glad to help. Please read this first. It tells your how to get a login for your space. At this moment, it is a command-line tool written using .NET 2.0. So you need to have it installed (the easiest option for that is Windows Update). Here’s how you run it in the most basic scenario: bctransfer -bu <old-username> -bp <old-password> -su <new-username> -sp <new-password> By default, the tool will give you a chance to review the results and then delete all transferred posts. I really don't want to break anything if you are already using MSN Spaces. This behaviour can be changed by adding the -k (keep) option. For more advanced scenarios, just type bctransfer and play with different options. Oh yeah, this MSN Space can be used as an illustration. It is full of my original awesome precious content transferred by the tool. In case if you care: it's a mix of Ukrainian/Russian/English languages. If you understand all of them there's another catch: so far I've only transferred posts from 2003/2004. No need to waste your time reading old news... ;] More GLEAMS will follow. |
|
|