Here is a TED talk by Dr.Arthur Benjamin ( Mathemagic ted talk person) who says, that stats & probability should be elevated to the status that is given to calculus
« May 2009 | Main | July 2009 »
Here is a TED talk by Dr.Arthur Benjamin ( Mathemagic ted talk person) who says, that stats & probability should be elevated to the status that is given to calculus
Posted at 01:38 AM in Statistics | Permalink | Comments (0) | TrackBack (0)
I was familiar with this technique, had used it in a couple of places. This week I had a chance to apply this technique to a problem at hand. However, this time I did not want to rush in . Had in my archives an old book called "User's guide to Principal Components". Took some time out to know a bit more about the technique. As always, there is always some new learning by reading old stuff.
A bit of history, to begin with. This technique was first introduced by Karl Pearson in 1901 (Refer to his other contributions here). The general procedure to carry out the technique had to wait until 1933 when Harold Hotelling introduced it in his paper. Thus 1930s and 1940s had a great deal of activities relating to PCA.Then things subsided and the development took place as a poisson process :). With the advent of computers, things changed. People were now able to find factors, invert matrices with ease and thus the application of PCA became widespread in various fields. An unfortunate development was that many social scientists extended PCA to factor analysis and interpreted factors by their own whims and fancies.
The author of this book , Mr. Edward Jackson has put in his thoughts from his working experience at Eastman Kodak for over 20 years. Books from practitioners are gems for they show how to marry theory and application , and more so, make you feel that there is no single answer to any problem. Smartness lies in using a specific technique to a situation with adequate mix of intuition and quantification. Another book, Super Crunchers, makes the same point
Principal Component Analysis, abbreviated as PCA, is a data exploration technique as well as an inferential technique. The methodology is same for population as well as samples( a greatly comforting aspect). Also most of the aspects relating to PCA are distribution free.
Let's say your are a police officer in charge of keeping a tab on the militant threats in Mumbai. Lets put some numbers to the situation. Imagine you track 10 variables at different points in the city.. You figure out that these 10 variables should to be present in a specific bands so that you are fairly confident of no militant threat in Mumbai. You have put in place police infra, informer infra etc in the city to get 10 dimensional vector from all the potential threat points in mumbai, let's say 5000 points, ending up with a 5000*10 matrix of observations.All the variables can be quantified on a scale of 1-10.Lets say the first variable x1 can be quantified on a scale of 1-10 and the mean is 5 and the bands are at 4.5 and 5.5. Any observation beyond this band might make you suspicious. Lets say you keep a track of all these 10 variables on a big chart. So, basically you add daily observations to your dashboard .
If all the 10 variables are with in the bands(lets say these are 95% confidence interval bands), would you be 95% confident that there is no militant threat ?.
At the outset, it might seem correct. You are tracking 95% intervals for each of the variables. However , on thinking a little bit, you realize that the type I error shoots up. Reason, the joint probability of threat,thanks to Type I errors is (1-0.95)*(1-0.95)*....10 times = 0.4. This means that if you control all 10 variables , there is 40% chance that one of the variables is outside the bands!!! Now this is assuming that the variables are uncorrelated. But think abt it, variables are often correlated , especially in a militant threat kind of situation. So , your Type I error goes up. What do you do ? Well , there are a couple of things you can do. One is see to it that the bands are designed in such a way that Type I and Type II error is built in to bands mechanism OR, do something which is relatively easier,Do not track these variables x1,x2,x3,.....x10 for 5000 sites, but a different linear combinations of these variables call then X1,X2,X3,X4 which have one great property , which is, the transformed vectors are uncorrelated.
Basically this is my pathetic attempt :) at explaining PCA with out any matrix notion. PCA helps you view the 10 variables sets in a different perspective which gives a lot more clarity on the relational structure in the data. By looking at the bands and Hotelling T-square you can act on the variables realization.Ok, now to say the same thing in matrix notation. Whatever be your input matrix, Let S be covariance matrix then one can find an orthonormal matrix U such that U'SU =L , a lower triangular matrix. The diagonal elements are eigen values which can also be obtained by solving |S-m*I|=0 .Characteristic vectors are obtained by (S-m*I)Xp = 0 where Xp is the characteristic vector.Thus one can transform the original data to z where z=U'[x-xbar). ( With out latex support, writing math is particularly difficult on typepad ). One use of this transformation is that L matrix is very useful becoz trace of L is the sum of original variances and det|S| is just the product of eigen values.
Another important aspect that I have learnt from this book is about the stopping criterion. I was using proportion of variance explained as a stopping factor for the number of components selected. Why ? I don't know....I just accepted somebody's word. I was completely dumb in not even thinking about it.However this book revealed to me that there is nothing great about that rule. Bartlett's test provides a better way to scientifically create a stopping condition. This was new to me. Somehow , it is never mentioned in the usual 10,000 ft view read of PCA. There are at least a dozen stopping criterion and I guess at least crunching half a dozen would bring upon a healthy dose of skepticism in your views.
What I could summarize above is just 5% of the book. The title is definitely an understatement. It should be have been "all encompassing guide" becoz in whatever field you are, for whatever problem you want to use PCA, there is a guideline in the book. This is one of those good books which focus less on abstract theorems and more on practical applications.
Posted at 05:06 PM in Statistics | Permalink | Comments (0) | TrackBack (0)
With easier access to raw computing power, one might think that all it takes is to analyze data and let the data throw hypotheses.
Is theorizing over in this data intensive world ? Has the world become so complicated that theorizing is useless? May be not.
I read this short note(Tom Swift and His Electric Factor Analysis Machine) couple of years back but I always reread it frequently so that I dont make the mistake of living only in an empiricist world. Some amount of apriori assumptions/ hypotheses are required. In the financial world, traders provide this knowledge. It is easier sit at a desk and crunch data and build models. However it might be a complete useless exercise if you do not marry apriori assumptions with your empirical work.
Here is a Link to this beautiful article: Tom Swift and his electric factor analysis machine
Posted at 01:05 AM in Statistics | Permalink | Comments (0) | TrackBack (0)
Parsimony is desirable but not always obtainable
- Edward Jackson (PCA Practitioner)
I guess it's true sometimes, about of our lives too !.
Posted at 01:13 PM in Reflections | Permalink | Comments (0) | TrackBack (0)
Via NYTimes:
Robert Gentleman & Ross Ihaka (people behind R)
“R is a real demonstration of the power of collaboration, and I don’t think you could construct something like this any other way,” Mr. Ihaka said. “We could have chosen to be commercial, and we would have sold five copies of the software.”
“The great beauty of R is that you can modify it to do all sorts of things,” said Hal Varian, chief economist at Google. “And you have a lot of prepackaged stuff that’s already available, so you’re standing on the shoulders of giants.”
I guess once you try R, you will be hooked forever.
Posted at 01:53 AM in Statistics | Permalink | Comments (0) | TrackBack (0)
Often times most of the data / variables that we come across appear random and we start taking decisions based on gut. If gut works , its great. However before bringing in the gut part , it might be sensible to stretch a bit and look at data carefully and examine the details.Some time, just by changing the our view, we might see patterns. well PCA is all about that. I don't have to go that far to make this point. For example, if one looks at the current graph of sin(x) for some values of x, it looks like this:
The data points appear completely random. However just by tweaking the aspect ratio(x:y), the same data looks like this:
I tend to believe to that there are patterns everywhere and it all boils down to spending time with data and trying to understanding it carefully.However one also needs to be skeptical about all the patterns that one finds and have a null hypothesis that pattern is random, back test the pattern and then form an opinion. Instead of having a null hypothesis that data is random, it is better to have a null hypo that data pattern is random.
Posted at 06:19 AM in Statistics | Permalink | Comments (0) | TrackBack (0)
In God we trust; all others bring data.
—Dr. W. Edwards Deming
Posted at 10:38 PM in Reflections, Statistics | Permalink | Comments (0) | TrackBack (0)
Paul Graham on Google :
Their hypothesis seems to have been that, in the initial stages at least, all you need is good hackers: if you hire all the smartest people and put them to work on a problem where their success can be measured, you win. All the other stuff—which includes all the stuff that business schools think business consists of—you can figure out along the way.
Posted at 10:33 PM in Startup Gyan | Permalink | Comments (0) | TrackBack (0)
A Random sample of my thoughts
................................May be I will find clues to the above stuff, someday!
Posted at 10:12 PM in Finance, Statistics | Permalink | Comments (1) | TrackBack (0)
Handling large amounts of data in R is tricky as R typically loads the entire dataset in to RAM. While this means that computations are going to be very fast, it also means that dataset that can be used for analysis is dependent on your RAM. One solution which I stumbled on was Filehash which seemed to offer me a solution than the usual way that I was going about. I was getting the entire stuff in postgreSQL and then doing computations using R.
Filehash seemed to be great as I don't have to bring another RDBMS in to the work loop. However after spending 1-2 hours on it, I realized it was not that useful for me. It is great if you have key value pairs, but not really great if you have too many columns with dependencies and there is a lot of subsets that you need to work on. In any case, filehash was nothing but DB2 sitting at the back of R. Why should I put my data in to DB2, when a powerful postgreSQL is at my disposal. I have ditched Filehash. There is another package called ff , but sadly it is more useful if you want to sample stuff from a big file. In summary , it appears to me that the only way out is Hadoop-R/Mapreduce-R. That's a steep learning curve for me.
Why should one think of fancy systems like hadoob?
Fact :On a typical day, 6 lakh NIFTY futures are traded in 20,100 seconds of a trading day ,meaning, 30 trades per second.Now each of these 30 trades happen at various price points and quantities. If you want to track the average futures price(quantity weighted) every second for a SINGLE day , it is easy. If you want to do it for a MONTH, it might take some time. If you want to do it for an YEAR and also calculate a 5 minute moving average of futures price, taking in to consideration gapups and gap downs across days, we are talking abt lots of computations on lots of data points. Even in such a simple application using RDBMS systems is a PAIN. I am guilty of using them now, but it will be of no use as things scale.
Posted at 07:19 PM in Technology | Permalink | Comments (0) | TrackBack (0)
Posted at 02:59 PM in Reflections | Permalink | Comments (0) | TrackBack (0)
Posted at 01:15 PM in Economy | Permalink | Comments (0) | TrackBack (0)