Email Migration

8 ways to use email header info and how to extract it

http://www.techrepublic.com/blog/opensource/8-ways-to-use-email-header-info-and-how-to-extract-it/3371?tag=content;blog-list-river

Marco Fioretti offers a tip to help you sort and process random emails by extracting the email header info first. Here are eight tasks you can accomplish with this script.

Have you ever had a relative, or a boss at the office, asking to “help them to reorder” some thousands email messages, scattered without rhyme nor reason over tens of folders? I have, and it’s not something you want to do entirely by hand, if you can possibly avoid it, because it’s terribly time-consuming.

Luckily, if all the email to reorder is already in Maildir or any other format in which each message is in a separate file, the solution is easier than you may think.

The first thing to do is the one in the title of this post: extract all the headers from each message and write them down, together with the name of the file containing it, in a format that will make further processing easier. You want to generate one single list, in which each email is represented by one plain text record like this:

  ###############################################################
  FILENAME:  /email/.2011.11/cur/1323.M217761.polaris,S=263474,W=267108:2,S
  SUBJECT:   Re: Presentations of Open Data Meeting
  FROM:      “M. Fioretti” <marco@digifreedom.net>
  TO:        Chris <chris@example.com>
  CC:        Marco <marco@digifreedom.net>, tom@example.com
  BCC:
  DATE:      Tue, 29 Nov 2011 01:23:47 -0800
  TIMESTAMP: 2011-11-29 09:23:47
  MSGID:     <20111119093315.GC7496@nexaima.net>
  INREPLY:   <1ffde8c9bafae02c8a4f2b27724992f8@10.30.200.104>
  ###############################################################

Why? Well, because an index like that makes it quite easy (again: if each message is in a separate file) to write simple scripts that use that list to perform any kind of further processing; for example, you could:

1.sort email in different folders according to any combination of criteria. You may, for example, write conditions as “if $FROM or $TO include the string “@mycompany.com” and $TIMESTAMP begins with 2011-11, move $FILE to a folder called 2011.11.mycompany”.

2.create different levels of access to email archives: “email between me, my superior officer and nobody else goes to a folder that nobody else can read, email to my subordinates goes to another folder that they can read”.

3.remove extra copies of the same message, by deleting all the files (excepted the first one, of course) that have the same MSGID (Message-ID header).

4.extract addresses and add them to address books or customers databases, depending on what those customers wanted. Example: “if $TO is support@mycompany.com, add $FROM to the list of people who asked for support”.

5.generate custom mailboxes, to satisfy requests like “please send me a copy of all the email we exchanged with Mr X during last quarter”.

6.create all sorts of statistics (and graphs) about email activity. If you wanted to know in which month of 2005 you got more email from your relatives, you’d need data as in the listing above.

7.feed everything to a relational database, in case you needed to perform really complex queries, or correlate those headers with other data.

8.analyze the route followed by each email, and how long it took (this is what the Received headers below are for).

Where’s the code?

When I found myself with almost 150K messages (no kidding!) to reorder for the reasons above, I quickly put together the “simplemailparser” that follows, which only needs the two Perl modules listed in lines 4 and 5. If the file passed as first argument (”ARGV[0]“) has a name that identifies it as an IMAP index file (lines 9 to 12), the script just exits. Otherwise, the whole content of the file (which, remember, contains only one email) is loaded inside the $raw_email variable. After that, all the real work is done by the Perl modules. The first one creates an email object from $raw_email (line 21) and then uses its internal functions to save all the headers inside separate variables. In lines 32-34, the other module uses the Date extracted by the first one to give all messages a $timestamp with the same time zone (compare DATE and TIMESTAMP in the listing above to see what I mean). Finally, the script prints everything out:

       1       #! /usr/bin/perl
       2
       3       use strict;
       4       use Email::Simple;
       5       use DateTimeX::Easy;
       6
       7       my $raw_email;
       8
       9       exit if (($ARGV[0] =~ m/\/dovecot\./) ||
      10                ($ARGV[0] =~ m/\/dovecot-/)  ||
      11                ($ARGV[0] =~ m/\/maildirfolder$/)
      12       );
      13
      14       print “#”x120, “\nFILE:      $ARGV[0]\n”;
      15
      16       open (MESSAGE, “< $ARGV[0]”) || die “Couldn’t open email $ARGV[0]\n”;
      17       undef $/;
      18       $raw_email = <>;
      19       close MESSAGE;
      20
      21       my $mail            = Email::Simple->new($raw_email);
      22       my $from_header     = $mail->header(“From”);
      23       my $to_header       = $mail->header(“To”);
      24       my $date_header     = $mail->header(“Date”);
      25       my $cc_header       = $mail->header(“CC”);
      26       my $bcc_header      = $mail->header(“BCC”);
      27       my $msgid_header    = $mail->header(“Message-ID”);
      28       my $subject_header  = $mail->header(“Subject”);
      29       my $inreply_header  = $mail->header(“In-Reply-To”);
      30       my @received        = $mail->header(“Received”);
      31
      32       my $timestamp     = DateTimeX::Easy->date($mail->header(“Date”));
      33       $timestamp->set_time_zone(“GMT”);
      34       $timestamp =~ s/T/ /;
      35
      36       print<<END;
      37       SUBJECT:   $subject_header
      38       FROM:      $from_header
      39       TO:        $to_header
      40       CC:        $cc_header
      41       BCC:       $bcc_header
      42       DATE:      $date_header
      43       TIMESTAMP: $timestamp
      44       MSGID:     $msgid_header
      45       INREPLY:   $inreply_header
      46       END
      47       exit;

Before you look for online pharmacies that sell buy viagra online without requiring a prescription from you. Many companies have exploited naturally new.castillodeprincesas.com purchase generic cialis available herbs to help males attain and maintain hard on. Australia shared fiber connections in total counted, 1.6 per cent of all broadband connections, while Japan had the highest fiber penetration order cialis from india that was 66.7 per cent. ED is constant inability of generating cheap online cialis and maintaining erection hard enough to penetrate. To run the script on all the messages in your top level email folder, use the find command:

find MyTopLevelEmailFolder -type f -exec simplemail_parser {} \; > email_index.txt

Then find something else to do until it’s finished. In fact, this procedure is slow, because it starts and runs Perl once per message. However, it takes less than five minutes to install the Perl modules, copy the script and launch it. Since, after launch, the scripts works by itself and you shouldn’t need to run it more than once per archive anyway, I think it’s a good compromise. Do you?
Why and how to migrate your (old) mailboxes to IMAP and Maildir

http://www.techrepublic.com/blog/opensource/why-and-how-to-migrate-your-old-mailboxes-to-imap-and-maildir/3398?tag=nl.e550

Takeaway: Marco Fioretti provides the steps to automatically move your old mailboxes to IMAP and MailDir.

This week’s topic is partly a natural match of my previous post, 8 ways to use email header info and how to extract it and partly a reaction to one of the comments it received:

With the advent of Gmail and the ongoing market dominance of Outlook in the corporate arena, this tip is all but useless

I strongly disagree, for the reasons you can read in that thread plus the fact that, even if one used Gmail etc… it would still be absolutely necessary to have a private, complete backup of all email, either on a server of yours, or on a local hard drive. As they say, I may be paranoid, but that doesn’t mean somebody isn’t out there to hurt me, that is, in this case, to close my webmail account. Therefore, this week I’ll explain a FOSS way to automatically copy a bunch of mailboxes to other IMAP/Maildir ones. Before that, however, I first need to give an approximate answer to the question…

What are IMAP and Maildir?

The Internet Message Access Protocol (IMAP) allows efficient remote access to email even if it remains stored on a remote server. If you access your remote inbox through an IMAP server, for example, you won’t need to download a whole uninteresting message or a big attachment just to delete it.

For the purpose of this post, mailboxes formats can be divided in two categories. In the first one (mbox and derivates), all the messages of each mailbox are written in one single file, one after another. If Bob’s “home” email directory is /home/bob/Mail and he has three mailboxes called Work, Family and Friends, a listing of that directory will show three files with the same names:

  #> ls -l /home/bob/Mail
  Family
  Friends
  Work

The other class of formats, instead, uses as mailboxes directories, normally with subdirectories, storing each email in a different file. The most popular representative of this category, unsurprisingly called Maildir, uses three subfolders per mailbox/directory, called new, cur (current) and tmp (temporary). If Bob used Maildir, he’d have this folder structure inside /home/bob/Mail:

  .Family/new
  .Family/cur
  .Family/tmp
  .Friends/new
  .Friends/cur
  .Friends/tmp
  .Work/new
  .Work/cur
  .Work/tmp

where unread messages stay in “new”, read ones in “cur” and “tmp” is used for temporary processing. Maildir (as the other directory-based formats) has a lot of advantages over mbox:

•it is fully supported by all IMAP servers around

•but you don’t need an IMAP server to use Maildir on your computer: any decent email client can access it directly

•it is more robust than single-file formats. With Maildir, it’s possible to delete an email from a mailbox just while the server adds another email to it without any data corruption

•it makes much easier to apply all the tricks described in my previous post

•it is great for incremental backups: adding one email to a 1000-messages mbox file changes it, forcing you to back it up completely. Adding one email to a Maildir means to back up only that new file

OK, how do I migrate to Maildir?

If this convinces you (as I hope) to convert to IMAP/Maildir all your mbox files, the question becomes how to do it automatically, especially if those files are scattered in several directories.

As a matter of fact, there’s no need to write complicated scripts, or use esoteric libraries. All you need is an email client (that is a program that already knows everything about mailboxes) that is capable to run inside a script, taking orders from it. Mutt is just such a client, and this is the second reason (the first is Mutt profiles) why I love it. Here are 20 lines of code (please note the credits) that will find all your mbox files and move their content to a Maildir:

       1  #! /bin/bash
       2  #CREDITS: inspired by: http://foolab.org/node/1737
       3
       4  for ORIG_MBOX in `find $1 -type f -exec file {} \; | egrep ‘ASCII mail|ISO-8859 mail text|UTF-8 Unicode mail text’ | cut -d: -f1 `
       5  do
       6    echo “Found mbox: $ORIG_MBOX”
       7    TARGET_MAILDIR=”imap://USER@SERVER/temp_email_folder”
       8    rm -f /tmp/MUTTCONF >& /dev/null
       9    cat > /tmp/MUTTCONF <<ENDMUTTCONF
      10  set folder=/dev/null
      11  set move=no
      12  set imap_pass=mypassword
      13  macro index <F3> “<tag-pattern>~A<enter><tag-prefix><copy-message>$TARGET_MAILDIR<enter>y<quit>y”
      14  folder-hook . push <F3>
      15
      16  ENDMUTTCONF
      17
      18      echo “Moving $ORIG_MBOX to $CURRENT_MAILDIR”
      19      mutt -F /tmp/MUTTCONF -m Maildir -R -f $ORIG_MBOX
      20    done
      21  exit

Line 4 finds all the files in the directory passed as first argument ($1), runs the “file” command on them and filters (egrep) only those whose description shows they are mailboxes. Lines 9 to 16 save to /tmp/MUTTCONF the the IMAP password, user name and location plus, above all, the Mutt macro that does the real work. Line 13 (check the Mutt Manual for details) means in fact “dear Mutt, when I press the F3 key copy all the messages in the current folder to $TARGET_MAILDIR and exit“. Line 14, instead, simulates the pressing of just that key.

Once the configuration file is available, line 19 of the script tells Mutt to use it (-F) with Maildir as default Mbox format, on the $ORIG_MBOX, opened in read-only mode (-R). Cool, huh?

Usage notes

The script above will copy the content of all the mailboxes it finds in the target folder to $TARGET_MAILDIR. To take full advantage of it, you should note that:

•if (unlike me) you have email in less common character sets, you will have to add them in the egrep part of line 4, or the script won’t recognize them

•the $TARGET_MAILDIR may be anywhere. You may, for example, replace USER@SERVER and mypasswd with the credentials of your account on any remote IMAP server (including Gmail…) to upload all the email on your drive to that server. Using 127.0.0.1 as SERVER, instead, will copy the messages to an IMAP server on your computer

•as a matter of fact, you don’t even need an IMAP server. Setting TARGET_MAILDIR to “/email/mymail_archive” would create a perfectly usable Maildir in that location

•as is, the script has no reason to create the /tmp/MUTTCONF file inside the loop. I did it there on purpose, to stress the fact that you may even move every mbox to a different maildir, by just setting a different $TARGET_MAILDIR at every iteration. You may even use a wholly different Mutt configuration file every time, if you wanted

This entry was posted in Technology. Bookmark the permalink.