Apache Solr | Finding term frequency of a specific term in Solr for indexed Documents

|
| By Webner

Document frequency and Term frequency are the terms that are generally used for finding relevance of the documents to the search query. When user types some word, then the number of documents containing the word returned in the result set is known as document frequency. Similarly, the number of times the word appears in each document can be found which is known as term frequency. Based on this information, results can be sorted to present the user with the most relevant documents.

In Apache solr, url below results in term frequency calculation for the word ‘solr’ in the indexed documents:

http://sorlurl.com/solr/booksdetails/select?q=solr&fl=id,termfreq%28%22attr_content%22,%22solr%22%29&sort=termfreq%28%22attr_content%22,%22solr%22%29%20desc

Results are sorted in descending order of term frequency values i.e. showing the the document with the highest term frequency of ‘solr’ word at the top and document with the least term frequency at the bottom:
1
This query can be typed in the solr interface against different query fields as below:
2
In the java programming language, term frequency can be calculated with the following program code using solrj client connectivity:

package solr.com;
import java.io.IOException;
import org.apache.solr.client.solrj.SolrClient;
import org.apache.solr.client.solrj.SolrQuery;
import org.apache.solr.client.solrj.SolrQuery.ORDER;
import org.apache.solr.client.solrj.impl.HttpSolrClient;
import org.apache.solr.client.solrj.SolrServerException;
import org.apache.solr.client.solrj.response.QueryResponse;
import org.apache.solr.common.SolrDocument;
public class TermFreq {
    public static void main(String[] args) throws SolrServerException, IOException {
        String urlString = "http://localhost:8983/solr/booksdetails";
        SolrClient server = new HttpSolrClient.Builder(urlString).build();
        SolrQuery qterms = new SolrQuery();
        qterms.setQuery("solr");
        qterms.setFilterQueries("attr_content:solr");
        qterms.setFields("id", "Terms:termfreq(\"attr_content\",\"solr\")");
        qterms.addSort("termfreq(\"attr_content\",\"solr\")", ORDER.desc);
        QueryResponse response = server.query(qterms);
        SolrDocumentList results = response.getResults();
        long totalhits = results.getNumFound();
        System.out.println("Number of documents found (Document frequecy):" + totalhits + "\n");
        int i = 1;
        for (SolrDocument doc: results) {
        System.out.println("Output for Document " + i++ + ":");
        System.out.println("-----------------------------------------------");
        System.out.println("Document ID:" + doc.getFieldValue("id"));
        System.out.println("Number of Terms (Term frequency):" + doc.getFieldValue("Terms") + "\n\n");
        }
    } //end of main()
} //end of class

Here, in this program ‘totalhits’ variable returns the number of documents containing term ‘solr’. termfreq(“Field_name”,”Term_name”) function is used to find the total number of terms with the specified name in each document eg. in the program ‘solr’ is term found in 4 documents with each document having different term frequency values and results are sorted in descending order. Here ‘Terms’ is an alias field name given to termfreq(“”,””) function, which can used to retrieve the function value instead writing complete function again(it is optional).

//Output for the above program in the console is:
3
From the output, it can be clearly seen that document with id=7 contains 247 words with the name ‘solr’ which is the maximum value among all the documents containing this term. In the next position is document with id=3 and tf=108, then with id=8 and tf=86 and at last is document with id=1 and tf=60.

This document just presents the basic understanding, how search results can be retrieved against user queries. Relevance functions in actual situations are more complex than presented in this document.

Leave a Reply

Your email address will not be published. Required fields are marked *