Two "todo" apps are vying for my iPhone's heart. Here's how I decided on a winner.
I like PDAs because they help me manage the things I have to do - and I'm all about the "todo" lists. I don't know if I've become dependent on lists because I have a bad memory, or if my memory is failing because I use lists for everything. Still, it is as it is.
Over the past year or so, a number of todo apps have come out for my beloved iPhone, and I've been trying most of them. It's surprising how I keep coming back to the same two apps, and equally surprising (to me) that after months of playing around with them, I still can't quite decide which one I prefer.
The two apps is Appigo's ToDo, and ToodleDo for the iPhone. Both cost only a few dollars, and both are very well-rated by the public at large.
So, I figured, lets use some design analysis tools to evaluate the two apps, and see what the numbers say.
I'm going to use two tools: pairwise comparison, and a weighted decision matrix. These tools aren't only useful for analyzing designs - they're basic decision-making tools, and they've always done right by me to evaluate designs, conceptual or otherwise.
Both tools depend on having a good set of criteria against which the two apps will be compared. You might not know what decision to make, but you need to know how you'll know that you've made the right one. In our case here: How do I know when I've found a good todo app?
The formal term for what I'm doing here is qualitative, multi-criterion decision-making. It generally comes involves four tasks, which in my case are:
Next, we have to develop weights to assign relative importance to the criteria. The word relative is key here; we're not going to say that one criterion is certainly and universally more important than any other. What I want is to know how important each is with respect to the others and my own experience. Remember, one size never fits all.
This is where pairwise comparison comes in. Details on how this works are given in another web page (it ain't hard). The chart below is just the end results. In each cell is the criterion that I thought was more important of the pair given by that cell's row and column. Since it doesn't make sense to compare something to itself, and since these comparisons are symmetric (comparing A and B is the same as comparing B and A), then I only need to fill in a little less than half of the whole chart. If you're thinking this took a long time, you'd be wrong. It took me about 15 minutes to fill in the whole thing.
This leads to the following weights:
So this tells me that I think having repeating tasks and good sorting of items are the two most important criteria.
The point of this process is that the human mind is not good at juggling a bunch of variables, but it is very good at comparing one thing against another. Take the trivial case of choosing between three alternatives, A, B, and C. If you prefer A to B, and B to C, then you should accept the logic that A is the most preferred item. To do otherwise just isn't rational. That's exactly what pairwise comparison does. And there's good evidence that this technique actually works.
The next step is to choose a rating scale. This scale will be used to rate each app with respect to each criterion.
There's a variety of scales I could use, and a great deal of research into qualitative measurement scales has been done. The scale that works best for me - and seems to be the most general - is a five-point scale from -2 to +2, where 0 means "neutral," -2 means "horrible," +2 means "excellent," and -1 and +1 are in-between values. If you prefer something a little finer, you can use a 7-point scale from -3 to +3. I think it's important to have a zero value to indicate neutrality, and I find it meaningful to have negative numbers stand for bad things and positive numbers for good things.
It's interesting to note that in some industries (e.g. aerospace), I've noticed a tendency to use an exponential scale - something like (0, 1, 3, 9). This is because aerospace people tend to be extremely conservative (for reasons both technical and otherwise), so they tend to underrate the goodness of things. This scale inflates any reasonable rating to make up for that conservatism.
But I'm neither an aerospace engineer nor particularly conservative, so I'll use the -2 to +2 scale.
Now we can do the weighted decision matrix. The gory details are given elsewhere. The weights come from the pairwise comparison above. In a decision matrix, we rank each alternative to some well-defined reference or base item. We need a reference because we need a fixed point against which to measure things. If we were evaluating design concepts, none of them would be suitable as references since a "concept" design is not well-defined. In this case, we're evaluating two existent web apps, so we can choose either one of them as the reference. For no particular reason, I'll use ToDo.
I worked up a weighted decision matrix comparing ToodleDo to ToDo. Here it is:
This table might not look like much, but it tells a bit of a story. ToDo is the reference, so I've given it zeros in every category. That way, when I compare ToodleDo to it, a positive number means it beats ToDo and a negative number means it's worse than ToDo. Obviously, they're very close to one another.
If you look at the ratings for ToodleDo, you see that it's a bit better than ToDo on some points, and a bit worse on others. But the +1's don't actually cancel out the -1's because of the weights. The criteria on which ToodleDo beat ToDo are more important to me than the others, because the weights are higher. That makes ToodleDo just a little bit better than ToDo.
And that jives nicely with my intuition. I got ToDo first, and enjoyed it. But ever since I got ToodleDo, I've preferred it. Every once in a while, I switch back to ToDo, but it never lasts very long. And up until I did this decision matrix, all I had was a vague intuition that ToodleDo was better for me; now, I actually have an explanation.
But there's a problem. ToDo handles repeating events internally; that is, when I check off the current instance of a repeating event, ToDo immediately creates the next one in the series. ToodleDo, on the other hand, generates subsequent repeating events only when you sync the app with the ToodleDo website.
This is a problem for me when I travel. I was in Berlin recently, for a conference. And I don't have a data plan for my iPhone (that's a whole separate story), so I couldn't sync either app. But that means ToodleDo couldn't roll repeating items over properly. So before I went to Berlin, I sync'd up ToDo and used it while I was gone. When I came back, though, I switched back to ToodleDo. When I go to Sweden at the end of March, I'll be using ToDo again.
Does the evaluation consider that? No it doesn't, because I didn't. The evaluation is only as good as the evaluator. When I evaluated the two apps, I was nestled snugly at home, WiFi at the ready - and sync'ing either ToDo or ToodleDo is a non-issue. If I'd've done the evaluation in Berlin, I'm sure I'd've gotten different numbers, because the repeating events problem would have been right there in my face.
So this underscores a limit with the evaluation method - indeed, a limit with any method: it's only as good as the situation you're in when you use it. Some people might say a method is only as good as the information you use, but it's more than that. My situation, in this case, includes me, my goals (at the time), my experiences, all the information I have handy, constraints, and anything else can possibly influence my decisions at the time.
The problem, then, is that a method depends on the situation when it's used. But that situation may be different for the person doing the evaluation than for the person(s) who will have to live with the decision being made. Indeed, it's virtually guaranteed that the situations will be different, if for no other reason than the implications of a decision will only occur later.
Does this put the kibosh on these kinds of methods?
Not at all. It just means that we must be vigilant and diligent in their application. If I did the evaluation in Berlin, ToDo would have won, because in that situation, ToodleDo would have scored poorly on repeating events. This is as it should be. That means that in the two different situations, the method worked. The problem is that in any one given situation, there's no way to take into account any other situations.
Happily, there is fruitful and vigorous research concerned exactly with this. Some people call it situated cognition; others call it situated reasoning. We've not yet figured out how to treat situations reliably, but I think it's only a matter of time before we do.
In the meantime, there is at least one other possible way to treat other situations. A popular technique to help set up a design problem is the use case (or what I call a usage scenario). These are either textual or visual descriptions of the interactions involved in using the thing you'll design. They can be quite complex and detailed. Usage scenarios try to capture a specific situation other than the one that includes the designers during the design process. So it's at least possible that usage scenarios could help designers evaluate designs and products better.
One final caveat: this evaluation is particular to me. It is unlikely that anyone will agree completely with my evaluation, because their situations are different from mine. So I'm not saying ToodleDo "is better" than ToDo. I'm just saying it seems to be better for me.
As they say: your mileage may vary.
I like PDAs because they help me manage the things I have to do - and I'm all about the "todo" lists. I don't know if I've become dependent on lists because I have a bad memory, or if my memory is failing because I use lists for everything. Still, it is as it is.
Over the past year or so, a number of todo apps have come out for my beloved iPhone, and I've been trying most of them. It's surprising how I keep coming back to the same two apps, and equally surprising (to me) that after months of playing around with them, I still can't quite decide which one I prefer.
The two apps is Appigo's ToDo, and ToodleDo for the iPhone. Both cost only a few dollars, and both are very well-rated by the public at large.
So, I figured, lets use some design analysis tools to evaluate the two apps, and see what the numbers say.
I'm going to use two tools: pairwise comparison, and a weighted decision matrix. These tools aren't only useful for analyzing designs - they're basic decision-making tools, and they've always done right by me to evaluate designs, conceptual or otherwise.
Both tools depend on having a good set of criteria against which the two apps will be compared. You might not know what decision to make, but you need to know how you'll know that you've made the right one. In our case here: How do I know when I've found a good todo app?
The formal term for what I'm doing here is qualitative, multi-criterion decision-making. It generally comes involves four tasks, which in my case are:
- Figure out criteria that apply to any "best" todo app.
- Rank the criteria by importance, because the most important criterion will affect my decision more than the others.
- Develop a rating scale to rate each app.
- Rate the apps with the rating scale and the weights.
- Fast. No long delays when telling the app to do something.
- Easy. Minimal clicking (e.g. hitting "accept" for everything or burrowing into deeply nested forms and subforms).
- Repeats. Repeating items at regular intervals.
- Priorities. At least three levels of priority for tasks.
- Checkoff. One-touch checking off of done items.
- Backup. Easy backup (or sync) to some remote server that is fairly robust, using standard formats.
- Groups. Group items by tag or folder or project or whatever.
- Sorting. Multiple ways to sort items.
- Hotlist. Some overview page showing only near-term, important items.
- Restart. Picks up next time I run it where I left off last time (oddly, not every iPhone app does this).
- Recovery. Uncheck items that were accidentally checked off.
- Conditional deadlines. Due dates based on due dates of other items (e.g. task B is due two weeks after task A is completed).
- Links. Link an item to a folder of other items.
Next, we have to develop weights to assign relative importance to the criteria. The word relative is key here; we're not going to say that one criterion is certainly and universally more important than any other. What I want is to know how important each is with respect to the others and my own experience. Remember, one size never fits all.
This is where pairwise comparison comes in. Details on how this works are given in another web page (it ain't hard). The chart below is just the end results. In each cell is the criterion that I thought was more important of the pair given by that cell's row and column. Since it doesn't make sense to compare something to itself, and since these comparisons are symmetric (comparing A and B is the same as comparing B and A), then I only need to fill in a little less than half of the whole chart. If you're thinking this took a long time, you'd be wrong. It took me about 15 minutes to fill in the whole thing.
Fast | Easy | Repeats | Priorities | Checkoff | Backup | Groups | Sorting | Hotlist | Restart | Recovery | Cond. Deadlines | Links | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Fast | - | Easy | Repeats | Priorities | Fast | Fast | Groups | Sorting | Hotlist | Fast | Fast | Cond. Deadlines | Fast |
Easy | - | Repeats | Priorities | Easy | Easy | Groups | Sorting | Easy | Restart | Easy | Easy | Easy | |
Repeats | - | Repeats | Repeats | Repeats | Repeats | Sorting | Repeats | Repeats | Repeats | Cond. Deadlines | Repeats | ||
Priorities | - | Priorities | Backup | Groups | Sorting | Priorities | Priorities | Recovery | Priorities | Links | |||
Checkoff | - | Backup | Groups | Sorting | Hotlist | Checkoff | Checkoff | Cond. Deadlines | Links | ||||
Backup | - | Backups | Sorting | Backup | Backup | Backup | Backup | Backup | |||||
Groups | - | Sorting | Hotlist | Groups | Groups | Groups | Groups | ||||||
Sorting | - | Sorting | Restart | Sorting | Sorting | Links | |||||||
Hotlist | - | Hotlist | Hotlist | Hotlist | Hotlist | ||||||||
Restart | - | Restart | Cond. Deadlines | Links | |||||||||
Recovery | - | Cond. Deadlines | Links | ||||||||||
Cond. Deadlines | - | Cond. Deadlines | |||||||||||
Links | - |
This leads to the following weights:
Fast | 6% |
Easy | 9% |
Repeats | 13% |
Priorities | 8% |
Checkoff | 3% |
Backup | 10% |
Groups | 10% |
Sorting | 13% |
Hotlist | 9% |
Restart | 4% |
Recovery | 1% |
Cond. Deadlines | 8% |
Links | 6% |
So this tells me that I think having repeating tasks and good sorting of items are the two most important criteria.
The point of this process is that the human mind is not good at juggling a bunch of variables, but it is very good at comparing one thing against another. Take the trivial case of choosing between three alternatives, A, B, and C. If you prefer A to B, and B to C, then you should accept the logic that A is the most preferred item. To do otherwise just isn't rational. That's exactly what pairwise comparison does. And there's good evidence that this technique actually works.
The next step is to choose a rating scale. This scale will be used to rate each app with respect to each criterion.
There's a variety of scales I could use, and a great deal of research into qualitative measurement scales has been done. The scale that works best for me - and seems to be the most general - is a five-point scale from -2 to +2, where 0 means "neutral," -2 means "horrible," +2 means "excellent," and -1 and +1 are in-between values. If you prefer something a little finer, you can use a 7-point scale from -3 to +3. I think it's important to have a zero value to indicate neutrality, and I find it meaningful to have negative numbers stand for bad things and positive numbers for good things.
It's interesting to note that in some industries (e.g. aerospace), I've noticed a tendency to use an exponential scale - something like (0, 1, 3, 9). This is because aerospace people tend to be extremely conservative (for reasons both technical and otherwise), so they tend to underrate the goodness of things. This scale inflates any reasonable rating to make up for that conservatism.
But I'm neither an aerospace engineer nor particularly conservative, so I'll use the -2 to +2 scale.
Now we can do the weighted decision matrix. The gory details are given elsewhere. The weights come from the pairwise comparison above. In a decision matrix, we rank each alternative to some well-defined reference or base item. We need a reference because we need a fixed point against which to measure things. If we were evaluating design concepts, none of them would be suitable as references since a "concept" design is not well-defined. In this case, we're evaluating two existent web apps, so we can choose either one of them as the reference. For no particular reason, I'll use ToDo.
I worked up a weighted decision matrix comparing ToodleDo to ToDo. Here it is:
Reference (ToDo) | ToodleDo | ||||
Weight | Rating | Score | Rating | Score | |
Fast | 0.06 | 0 | 0 | 0 | 0 |
Easy | 0.09 | 0 | 0 | -1 | -0.09 |
Repeats | 0.13 | 0 | 0 | 0 | 0 |
Priorities | 0.08 | 0 | 0 | 0 | 0 |
Checkoff | 0.03 | 0 | 0 | 0 | 0 |
Backup | 0.10 | 0 | 0 | -1 | -0.1 |
Groups | 0.10 | 0 | 0 | 0 | 0 |
Sorting | 0.13 | 0 | 0 | 1 | 0.13 |
Hotlist | 0.09 | 0 | 0 | 1 | 0.09 |
Restart | 0.04 | 0 | 0 | 0 | 0 |
Recovery | 0.01 | 0 | 0 | 0 | 0 |
Cond. Deadlines | 0.08 | 0 | 0 | 1 | 0.08 |
Links | 0.06 | 0 | 0 | 0 | 0 |
0 | 0.11 |
This table might not look like much, but it tells a bit of a story. ToDo is the reference, so I've given it zeros in every category. That way, when I compare ToodleDo to it, a positive number means it beats ToDo and a negative number means it's worse than ToDo. Obviously, they're very close to one another.
If you look at the ratings for ToodleDo, you see that it's a bit better than ToDo on some points, and a bit worse on others. But the +1's don't actually cancel out the -1's because of the weights. The criteria on which ToodleDo beat ToDo are more important to me than the others, because the weights are higher. That makes ToodleDo just a little bit better than ToDo.
And that jives nicely with my intuition. I got ToDo first, and enjoyed it. But ever since I got ToodleDo, I've preferred it. Every once in a while, I switch back to ToDo, but it never lasts very long. And up until I did this decision matrix, all I had was a vague intuition that ToodleDo was better for me; now, I actually have an explanation.
But there's a problem. ToDo handles repeating events internally; that is, when I check off the current instance of a repeating event, ToDo immediately creates the next one in the series. ToodleDo, on the other hand, generates subsequent repeating events only when you sync the app with the ToodleDo website.
This is a problem for me when I travel. I was in Berlin recently, for a conference. And I don't have a data plan for my iPhone (that's a whole separate story), so I couldn't sync either app. But that means ToodleDo couldn't roll repeating items over properly. So before I went to Berlin, I sync'd up ToDo and used it while I was gone. When I came back, though, I switched back to ToodleDo. When I go to Sweden at the end of March, I'll be using ToDo again.
Does the evaluation consider that? No it doesn't, because I didn't. The evaluation is only as good as the evaluator. When I evaluated the two apps, I was nestled snugly at home, WiFi at the ready - and sync'ing either ToDo or ToodleDo is a non-issue. If I'd've done the evaluation in Berlin, I'm sure I'd've gotten different numbers, because the repeating events problem would have been right there in my face.
So this underscores a limit with the evaluation method - indeed, a limit with any method: it's only as good as the situation you're in when you use it. Some people might say a method is only as good as the information you use, but it's more than that. My situation, in this case, includes me, my goals (at the time), my experiences, all the information I have handy, constraints, and anything else can possibly influence my decisions at the time.
The problem, then, is that a method depends on the situation when it's used. But that situation may be different for the person doing the evaluation than for the person(s) who will have to live with the decision being made. Indeed, it's virtually guaranteed that the situations will be different, if for no other reason than the implications of a decision will only occur later.
Does this put the kibosh on these kinds of methods?
Not at all. It just means that we must be vigilant and diligent in their application. If I did the evaluation in Berlin, ToDo would have won, because in that situation, ToodleDo would have scored poorly on repeating events. This is as it should be. That means that in the two different situations, the method worked. The problem is that in any one given situation, there's no way to take into account any other situations.
Happily, there is fruitful and vigorous research concerned exactly with this. Some people call it situated cognition; others call it situated reasoning. We've not yet figured out how to treat situations reliably, but I think it's only a matter of time before we do.
In the meantime, there is at least one other possible way to treat other situations. A popular technique to help set up a design problem is the use case (or what I call a usage scenario). These are either textual or visual descriptions of the interactions involved in using the thing you'll design. They can be quite complex and detailed. Usage scenarios try to capture a specific situation other than the one that includes the designers during the design process. So it's at least possible that usage scenarios could help designers evaluate designs and products better.
One final caveat: this evaluation is particular to me. It is unlikely that anyone will agree completely with my evaluation, because their situations are different from mine. So I'm not saying ToodleDo "is better" than ToDo. I'm just saying it seems to be better for me.
As they say: your mileage may vary.
Very nice post, albeit more of a tutorial on how to properly set up and conduct a comparative review than an actual review of the two productivity apps :o)
ReplyDeleteTrue. But I wanted to give readers the chance to run their own "analysis" taking into account their own interests and personal characteristics. While I came to one conclusion, there's no reason why others would reach the same conclusion running the same type of analysis.
ReplyDeleteI just read your evaluation of TODO and TOODLEDO, and because I was wondering about the same things. I acknowledge your articulation of and expression of the subject. I feel like I am not alone anymore, there is someone else in the world who is jumping back and forth with these two APPS.
ReplyDeleteHow do you feel about the NEW TODO ONLINE SYNC? I love it and I prefer TODO NOW because of the ONLINE SYNC. it works pretty good.
I haven't tried the new ToDo sync service. No need. I only use it as a backup medium, and the free Toodledo service is enough for me. I know Toodledo's free service doesn't capture some relationships that ToDo can support (like checklists) but it does capture all the tasks - even tasks in checklists. And that's enough for me.
ReplyDeleteRemember, I'm a minimalist about these things.
You might enjoy other posts of mine about productivity at my dedicated blog: http://dofastandwell.blogspot.com/
Cheers.
Fil
A very good explanation. Congratulations!!!
ReplyDeleteThanks!
ReplyDelete