All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
[1.0.2] - 2021-06-20 (Marcel Jerzyk)
- Research Paper "Evaluating the Potential of a Candidate for a Job Offer Based on his GitHub Profile" in
.pdfformat (here).
- Updated
README.mdwith information about theReasearchPaper.pdf.
- Old
LANG.mdfile from./docs.
[1.0.1] - 2021-06-20 (Marcel Jerzyk)
- One more idea in the Further Reasearch in
conclusions.texsection. - Report #2 file (
report_2_for_group_m2.tex) to thearchivedirectory.
- Added missing paper images to the repository.
[1.0.0] - 2021-06-20 (Marcel Jerzyk)
Final release.
- Some final formatting to the paper like page breaks where it would make the article look better.
- Added self-defined formatting for code listings so it doesn't look as bad and highlights keywords, comments and strings properly.
- Updated
abstract.texwith Results. - The final amount of sample size used in algorithms was added (
creating_ml_model.tex). - Explanation what
Rscript do was added to thedata_collection.tex - Removed unused reference.
- Older versions of sections in the
paper/misc/archivedirectory.
[0.11.2] - 2021-06-17 (Marcel Jerzyk)
- For privacy, removed the github links and usernames from data files.
[0.11.1] - 2021-06-17 (Marcel Jerzyk)
- Changed names of the files to be less ambiguous.
- Rebase, changelog entries.
- Empty
reproductiondirectory. - Token from
repo_data.r
[0.11.0] - 2021-06-17 (Jakub Litkowski)
- New
extract_data_from_git.rscript added. - New
cleaned_data_with_ql.csv
[0.10.1] - 2021-06-17 (Marcel Jerzyk)
- Moved
model_script.rfrom./reproductioninto./src/gitprofiler/r_scripts/directory - Moved & renamed
model_data_no_labels.csvfrom./reproductioninto./data/directory. - Rebase, changelog entries and tweak &
./README.mdupdate regarding reproduction.
[0.10.0] - 2021-06-17 (Jakub SzaΕca)
model_script.Rfor reproduction purposesmodelDataNoLabels.csvfor reproduction purposes
[0.9.0] - 2021-06-17 (Marcel Jerzyk)
conclusions.texsection written & translated.discussion.texsection written & translated.results.texsection written & translated.- Conclusion written in
abstract.texsection.
- Added
urlformatting in paper.
- Cropped image in systematic review section and adjusted the width.
[0.8.1b] - 2021-06-15 (Marcel Jerzyk & Jakub Litkowski)
README.mdupdated with reproduction instruction for groupsM2,M3&M4.
[0.8.1a] - 2021-06-13 (Marcel Jerzyk)
- Added
pandas,numpy,requestsandgitpythonas required packages inrequirements.txt.
[0.8.1] - 2021-06-13 (Marcel Jerzyk)
- In
README.md: information about the paper directory.
- Moved
/literature/under the./datadirectory.
- Spelling in
CHANGELOG.md
[0.8.0] - 2021-06-13 (Marcel Jerzyk)
Further work on the paper:
- Creating Machine Learning Model section created
- First results are available in Results section
- Whole paper was re-checked via grammar checker and adjusted accordingly
- Whole paper got "lifting" via improved formatting
- Sections got labels so they can be cross-referenced for improved "user reading experience"
- Improved language and added contents in Data Processing section
- Among others, details regarding manual pre-processing of questionnaire.
- Data section in
README.md
[0.7.2] - 2021-06-13 (Marcel Jerzyk)
Used literature as .pdfs in the ./data/literature/ directory
[0.7.1] - 2021-06-04 (Marcel Jerzyk)
- Literature Review was retranslated to formal ,,paper'' English.
[0.7.0] - 2021-06-02 (Marcel Jerzyk)
-
New script:
.../merge_jsons.pywas added to the project.-
This script takes on input:
str:github_username,
-
Output is stored (by default) in:
-
The purpose of this script is to merge all of the
.jsonfiles that are generated via.../scan_repositories.pyscript into one.jsonfile that contains merged information from all of them in a 'gather' mode. This means that when in some other file information fetched by given combo of language-linter is present, then the results are summed up. This allows for even easier usage of the Mega Linter information via ML model because the processing work is already done. -
The script also takes care of different
.json-s as some of them are linter-related but others contain aggregated results. -
Structure of the file is as follows:
{ "<lang>": { "<linter>": { "errors": int, "files": int, "fixed": int }, "total": { "clones": int, "duplicate_lines_num": int, "duplicate_tokens_num": int, "files": int, "lines": int, "tokens": int } } }- There's special case for
<lang>which is"Total:"- it contains all aggregated results. - The
"total"subkey doesn't always have to exist, for examples it does not in<lang>:cspell,xml,yaml.
- There's special case for
-
Example final output:
{ "Total:": { "total": { "clones": 163, "duplicate_lines_num": 6061, "duplicate_tokens_num": 52202, "files": 428, "lines": 44590, "tokens": 473152 } }, "java": { "checkstyle": { "errors": 108, "files": 109, "fixed": 0 }, "total": { "clones": 56, "duplicate_lines_num": 948, "duplicate_tokens_num": 10656, "files": 106, "lines": 11093, "tokens": 106282 } }, "python": { "black": { "errors": 59, "files": 62, "fixed": 0 }, "flake8": { "errors": 2438, "files": 62, "fixed": 0 }, "isort": { "errors": 37, "files": 62, "fixed": 0 }, "pylint": { "errors": 2, "files": 62, "fixed": 0 }, "total": { "clones": 40, "duplicate_lines_num": 798, "duplicate_tokens_num": 8403, "files": 58, "lines": 6111, "tokens": 57864 } }, "spell": { "cspell": { "errors": 9555, "files": 640, "fixed": 0 }, "misspell": { "errors": 18, "files": 640, "fixed": 0 } } }
-
-
Added
paperdirectory.- It contains the LaTeX paper which got complete overhaul.
- Redundant text and formatting was removed.
- Comments that were pointless were removed and existing comments were standardized
- Moved sections of paper to separate files.
- Adjusted heading/title sections.
- Resolved few errors and warnings.
- Created directory
/img/for images. - Created
/misc/folder for all the stuff that is not tightly related to the paper or structure.
- Script
.../scan_repositories.pynow fetches repositories of given user automatically (no need to provide separate repositories list in order to fetch them)- New input:
str:github_username,
- New input:
[0.6.0] - 2021-05-25 (Marcel Jerzyk)
- New script:
.../scan_repositories.pywas added to the project.- This script takes on input:
str:github username,list[str]:github repositories names
- It automates the repositories cloning, Mega Linter linting and Mega Linter Scraper scraping routine. It does exactly as what the intuition might suggest:
- first starts
git cloneon every single repository name given on input - then it uses MegaLinter to parse the repositories contents and generate na output file
- and at the end - uses the previously made
.../scraper.pyto parse the file contents into a machine-readable.jsonformat
- first starts
- The output results are stored in
/data/repositories/<username>_(the directory will be automatically created if it's not present yet)_ - The script will also try to clean the repositories directory from directories generated by earlier launches.
- This script takes on input:
[0.5.0] - 2021-05-11 (Marcel Jerzyk)
- New directory:
./docscontaining various files regarding the technical side of the project as well as images used in markdown files. - LaTeX document changes tracker:
LANGv1.md. It has previous versions of sections and subsections:- Systematic Review
- Research Questions
- Resources to Be Searched
- Results Selection Process
- Data Collection
- Data Pre-processing
- Moved directory:
./img/readme/*inside./docs(now: [./docs/img/readme/*]).
[0.4.0] - 2021-05-11 (Marcel Jerzyk)
This changelog entry will be filled in a few days.
[0.3.1] - 2021-05-08 (Marcel Jerzyk)
- Logged in this file Version History for
0.3.1. - Logged in this file Version History for
0.3.0. - Logged in this file Version History for
0.2.1. - Logged in this file Version History for
0.2.0. - Logged in this file Version History for
0.1.0. - Logged in this file Version History for
0.0.1.
- Version Number in
README.md.
[0.3.0] - 2021-05-08 (Marcel Jerzyk)
- New Script:
scrape.py- The script takes on input output file which can be generated via Mega Linter by redirecting the standard output stream into a text file (
> output.txt). - Script parses the log data and scrapes duplicates table information into
dictionary:{ "language": str, "files": int, # amount of detected files in given language by linter "lines": int, # amount of detected lines in a given language "tokens": int, # amount of detected tokens ("chars") in a given language "clones": int, "duplicate_lines_num": int, "duplicate_lines_percent": float, "duplicate_tokens_num": int, "duplicate_tokens_percent": float }, - Script parses the log data and scrapes summary table information into
dictionary:{ "language": str, "linter": str, "files": int or str, # amount of detected files in given language by linter "fixed": int, # amount of fixed errors automatically by linter "errors": int # amount of errors that could not be fixed by linter }, - All available information are properly parsed and saved as
output.jsonfile that contains list of the previously mentioned dictionaries.
- The script takes on input output file which can be generated via Mega Linter by redirecting the standard output stream into a text file (
- New File:
CHANGELOG.md- This file serves as a diary of the progress of the programming side of this project.
- Added new
README.mdentry about the new script file. It contains the information about the requirements needed in order to run the script as well as the run process itself with the expected output data.
[0.2.3] - 2021-05-04 (Marcel Jerzyk)
README.md:- Tutorial on how to install Docker environment.
- Tutorial on how to run Mega Linter locally on own repository.
[0.2.2] - 2021-04-26 (Jakub Litkowski)
- Fixed:
value is missing where true / false is required - Fixed:
arguments suggest a different number of lines: 1, 5, 0
[0.2.1] - 2021-04-26 (Marcel Jerzyk)
README.md:- Added information & created
.giffiles for group M2 that should work as exhaustive instruction on how to use the R Studio.- Tutorial: How to generate own GitHub Token.
- Instruction: Installing R Studio.
- Navigating in R Studio.
- Added tutorial & created
.giffiles that contain exhaustive information on how to use Mega Linter through Github Actions.- Information about the CI file and what contents it should have (including snippet for easy copy-paste).
- Step by step actions in order to trigger the CI/CD Pipeline in GitHub.com on own repository with Mega Linter job.
- Added information & created
[0.2.0] - 2021-04-20 (Jakub Litkowski)
-
GraphQL Query Created
-
Scraped information from Query into variables:
- Bio:
repositoriesNames <- json$data$repositoryOwner$repositories$edges$node$name bio <- json$data$repositoryOwner$bio isHireable <- json$data$repositoryOwner$isHireable emptyRepos <- json$data$repositoryOwner$repositories$edges$node$isEmpty commitMsgE <- json$data$repositoryOwner$repositories$edges$node$defaultBranchRef$target$history$edges
- Commit Messages:
commitMSGList <- list() commitDates <- list() for(d in commitMsgE) { for(i in 1:length(d$node$author$user$login)){ if( d$node$author$user$login[i] == "Luzkan"){ commitDates <- c(commitDates,d$node$committedDate[i]) } } } for (p in commitMsgE) { for(i in 1:length(p$node$author$user$login)){ if( p$node$author$user$login[i] == "Luzkan"){ commitMSGList <- c(commitMSGList,p$node$message[i]) } } }
- Used Languages:
languages <- json$data$repositoryOwner$repositories$edges$node$languages$edges languagesList <- list() for (l in languages) { print(l) languagesList <- c(languagesList, l$node$name ) }
-
Calculating Time Between Commits:
time_between_commits <- list() for(idx in seq_along(commitDates)){ if (idx+1 > length(commitDates)){ break } dateOne<-as.POSIXct(commitDates[[idx]], format = "%Y-%m-%dT%H:%M:%SZ") dateTwo<-as.POSIXct(commitDates[[idx+1]], format = "%Y-%m-%dT%H:%M:%SZ") time_between_commits[[idx]]<- as.numeric(difftime(dateOne,dateTwo, units="mins")) } average_time_between_commit <- mean(unlist(time_between_commits))
[0.1.0] - 2021-03-16 (Marcel Jerzyk)
-
Created
README.mdfor the project that contains various useful information, requirements and instructions in order ot run the program. -
Created initial file structure.
-
github_graphql.rfile:- Added imports that are required for GraphQL query creation:
library("ghql") library("jsonlite") library("dplyr")
- GraphQL Connection Object
# GraphQL Connection Object (GitHub) connection <- GraphqlClient$new( url = "https://api.github.com/graphql", headers = list(Authorization = paste0("Bearer ", token)) )
- Informative Example GraphQL Query
new_query$query('mydata', '{ repositoryOwner(login:"Luzkan") { repositories(first: 5, orderBy: {field:PUSHED_AT,direction:DESC}, isFork:false) { edges { node { name stargazers { totalCount } } } } } }')
- Execution, parsing & writing to
.jsonoutput
# Execute Query (result <- connection$exec(new_query$queries$mydata)) # Parse to more human readable form jsonlite::fromJSON(result) # Writing to file write(result, "output.json")
[0.0.1] - 2021-03-01 (Lech Madeyski)
Project was initialized.