CorpusTextCreation
CorpusTextCreation
Class to create the corpus txt file for the semantic search model from a dataframe.
The class contains the following methods:
- concat_columns: Concatenate the columns to create the corpus from the dataframe. This will take all the columns in the dataframe and concatenate them to create the corpus in the correct format for the Fleming Frontend.
- write_corpus_to_file: Write the corpus to a file from the concatenated columns.
Example
from fleming.discovery.corpus_creation import CorpusCreation
from pyspark.sql import SparkSession
# Not required if using Databricks
spark = SparkSession.builder.appName("corpus_creation").getOrCreate()
corpus_df = spark.read.csv("/tmp/corpus.csv", header=True, inferSchema=True)
corpus_file_path = "/tmp/search_corpus.txt"
corpus_creation = CorpusCreation(corpus_df, corpus_file_path)
corpus = corpus_creation.concat_columns("RepoName", "RepoLink", "RepoDescription")
corpus_creation.write_corpus_to_file(corpus)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
spark |
SparkSession
|
Spark Session |
required |
corpus_df |
df
|
Source dataframe of the corpus |
required |
corpus_file_path |
str
|
File path to write the corpus |
required |
Source code in src/fleming/discovery/corpus_creation.py
20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 |
|
concat_columns(df, item_name_column, item_link_column, item_summmary_column)
Concatenate the columns to create the corpus
Parameters: df(df): Cleaned dataframe
Returns: corpus(list): List of concatenated columns with the format of each string consising of the following:
Example: {"Name":"Fleming","Link":"https://github.com/sede-open/Fleming","Summary":"Open-source project of the "brain" of the ai discovery tool. Includes technical scripts to build, register, serve and query models on databricks. Models can be run on cpu and not gpu providing signiifcant cost reductions. Databricks is utilized to build and train machine learning models on the ingested data. "}{"filter":{"LicenceFileContent":"Apache License 2.0","Archived":"Active"}}
Source code in src/fleming/discovery/corpus_creation.py
write_corpus_to_file(corpus)
Write the corpus to a file
Parameters: corpus(list): List of concatenated columns
Returns: None