LSA

More Information
The names Latent Semantic Analysis (LSA) and Latent Semantic Indexing (LSI) are frequently used interchangeably as they are based on the same underlying theories and processes. The distinction is really one of application, as the same mathematics and computation is employed for both. LSA may be considered to refer to a broad collection of application while LSI is more closely associated with information retrieval. That said, we will use the name LSA as it indicates a wider scope.
Latent Semantic Analysis (LSA) is a fully automatic technique for extracting and inferring the meaning of terms and documents. To obtain the "latent" structure of a document collection, LSA begins with a large collection of machine-readable text that consists of multiple documents. Documents are a predetermined size of text such as paragraphs, collections of paragraphs, sentences, book chapters, whole books, etc. depending on the application. These documents are parsed into terms, which are the individual components that make up a document. Typically, terms are words, but they can be phrases or concepts, also depending on the application. The terms and documents are organized into a single large matrix which is then processed using the singular value decomposition (SVD) process.
The SVD produces a dense multi-dimensional hyperspace representation of the document collection (a LSA space) containing vectors corresponding to the terms and documents of the collection. Within this "semantic space", the meaning of a term is represented as the average effect that it has on the meaning of documents in which it occurs. Similarly, the meaning of a document is represented as the sum of the effect of all the terms it contains. Each term and document is represented by a single vector, with each vector having a certain number of elements or dimensions in the LSA space. In this LSA space the closeness of terms and documents can be determined by examining the position and proximity of term and document vectors. The position of terms and documents in the vector space serves as a "semantic indexing". Terms close to one another in the LSA space are considered to have similar meaning regardless of whether or not they appear in the same document. Likewise, documents are identified as similar to each other if they have close proximity in the LSA space regardless of the specific words they contain.
In information retrieval applications, user's queries are projected into the LSA space and documents in the neighborhood of the projected query are returned in ranked order. The nearby documents need not share any terms with the user's query since the location of the query in the vector space is determined by the underlying semantic structure of the document collection. In other analysis applications the entire content of the document collection can be analyzed by comparing the relative proximity of all terms or all documents with each other.
Contact us to find out how we can help you utilize LSA technology in your application domain.