Please note that I am a total beginner with machine learning and artificial intelligence and also a novice with Python (I'm sure I have a very non-Pythonic way of writing code).
I have a project in which I have to develop a CV Recommender system. The recommender system will be specialized, so to say, for finding suitable candidates for IT job positions. Hence, I have set this "domain constraint" on developing the system (implemented a specialized stop-words removal function for IT domain - more on it below).
I'm developing the application in Python 2.7.6 + sklearn 0.15.2 and so far my progress is the following.
I know this:
I have a folder containing multiple resumes stored as .PDF files.
As a beginning I understand (or so I have read) that in any machine learning application there are 3 main phases to development:
So, for starters I have done a personal implementation of a "bag of words" approach where I read multiple resumes stored as .pdf files. The contents of each file is passed through a pre-processing stage where I eliminate English stop-words.
I have added a slight modification to the regular process of comparing if a word is present in the English dictionary in order for the elimination process to not filter the "words" which are technologies in the programming world (C#, C++, F# etc.) which would be considered non-words (below is the part of code which does this stopwords removal).
def get_words(documents, remove_stopwords): document_words =  for resume in documents: document_words.append(get_bag_of_words(resume)) if remove_stopwords: document_words = filter_stopwords(document_words) cleaned_resume_words = special_character_cleanup(document_words) return cleaned_resume_words def get_bag_of_words(file_name, pos_tagging=None): if pos_tagging is None: file_text = get_file_text(file_name) words = file_text.split(' ') words = filter(lambda word: word != "", words) return words else: return "POS tagging version not implemented yet" def special_character_cleanup(text): # Removes both whitespaces and special characters # LOGIC : If the word which is checked has a length larger than 0 after all special characters have been removed # then it means that it is a word and that the special characters confer it a meaning. We continue and proceed to # appending the word after it is cleared from the additional characters except the trailing ones in the regular # expression definition : +#- cleaned_words_array =  for sub_array in range(0, len(text)): cleaned_words_array.append() for word in text[sub_array]: if len(sub(r'[^a-zA-Z]+', '', word)) > 0: if not word.endswith("."): cleaned_words_array[sub_array].append(sub(r'[^a-zA-Z0-9 .+#-]+', '', word)) else: cleaned_words_array[sub_array].append(sub(r'[^a-zA-Z0-9 .+#-]+', '', word[:-1])) return cleaned_words_array
After this pre-processing phase I'm unsure how to proceed to the model training phase.
I generally build an array with a vocabulary that contains all distinct words found in all of my resumes (after they have been passed through the above pre-processing phase) and then consider this my "feature vector".
# File paths for resumes and job description resumesPath = '/home/radu/MLData/resume_examples/Smith/' # Generating the resume and job description file names from their corresponding paths resumes = get_files(resumesPath) # Extracting the feature vectors from each of the files (resumes and job descriptions) remove_stopwords = True resumes_data = get_words(resumes, remove_stopwords) print "\nResumes vocabulary: " + str(sum(len(array) for array in resumes_data)) print sorted(generate_vocabulary(resumes_data)) ------------------ def generate_vocabulary(texts): vocabulary =  for text in texts: for word in text: if word not in vocabulary: vocabulary.append(word) return vocabulary
Based on this vocabulary I proceed to train a very simple
GaussianNB model and then simply predict the class of a test instance.
The model is trained with a
frequency array which is a bi-dimensional array (an array of arrays).
frequency array actually contains the number of time each word in the vocabulary appears in each of the resume files.
I also give the model a
classes array which is generally hard-coded, so far.
classifier = GaussianNB() classifier.fit(frequency, classes) print(classifier.predict(c_test_data))
I need help with:
Now, what I am unsure of is how to proceed further. I'm not sure what is the correct path I should follow next, or even if there is such a thing or not.
I would like you guys to help me understand what I could do next to improve my process, what metrics I could use for validating my model?
Also, I feel like things are not super clear in my mind but is this, so far, a good implementation of a "bag of words" model or is it a terrible hack'n'slash of different principles from "bag of words"?
Thank you in advance for reading this post and for any tips on how to proceed.
Also, if you want to browse my super-dirty code, it is on Github - the "root" part of this code can be found under