Create the Corpus

Documentation

The first step is to create a corpus.txt file, this includes all the valuse you will search over. The below function ingests a dataframe each row being a seperate entry into the corpus.

Please find an example below.

For more information about options within the Class please follow the documentation under the code-reference section.

Example

from fleming.discovery.corpus_creation import CorpusCreation
from pyspark.sql import SparkSession

# Not required if using Databricks
spark = SparkSession.builder.appName("corpus_creation").getOrCreate()

corpus_df = spark.read.csv("/tmp/corpus.csv", header=True, inferSchema=True)
corpus_file_path = "/tmp/search_corpus.txt"

corpus_creation = CorpusCreation(corpus_df, corpus_file_path)
corpus = corpus_creation.concat_columns(df_analytics_cleaned)
corpus_creation.write_corpus_to_file(corpus)