Changelog

All notable changes to this project will be documented in this file.

The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.

[1.0.2] - 2021-06-20 (Marcel Jerzyk)

Added

Research Paper "Evaluating the Potential of a Candidate for a Job Offer Based on his GitHub Profile" in .pdf format (here).

Changed

Updated README.md with information about the ReasearchPaper.pdf.

Removed

Old LANG.md file from ./docs.

[1.0.1] - 2021-06-20 (Marcel Jerzyk)

Added

One more idea in the Further Reasearch in conclusions.tex section.
Report #2 file (report_2_for_group_m2.tex) to the archive directory.

Fixed

Added missing paper images to the repository.

[1.0.0] - 2021-06-20 (Marcel Jerzyk)

Final release.

Added

Some final formatting to the paper like page breaks where it would make the article look better.
Added self-defined formatting for code listings so it doesn't look as bad and highlights keywords, comments and strings properly.

Changed

Updated abstract.tex with Results.
The final amount of sample size used in algorithms was added (creating_ml_model.tex).
Explanation what R script do was added to the data_collection.tex
Removed unused reference.

Removed

Older versions of sections in the paper/misc/archive directory.

[0.11.2] - 2021-06-17 (Marcel Jerzyk)

Changed

For privacy, removed the github links and usernames from data files.

[0.11.1] - 2021-06-17 (Marcel Jerzyk)

Changed

Changed names of the files to be less ambiguous.
Rebase, changelog entries.

Removed

Empty reproduction directory.
Token from repo_data.r

[0.11.0] - 2021-06-17 (Jakub Litkowski)

Added

New extract_data_from_git.r script added.
New cleaned_data_with_ql.csv

[0.10.1] - 2021-06-17 (Marcel Jerzyk)

Changed

Moved model_script.r from ./reproduction into ./src/gitprofiler/r_scripts/ directory
Moved & renamed model_data_no_labels.csv from ./reproduction into ./data/ directory.
Rebase, changelog entries and tweak & ./README.md update regarding reproduction.

[0.10.0] - 2021-06-17 (Jakub Szańca)

Added

model_script.R for reproduction purposes
modelDataNoLabels.csv for reproduction purposes

[0.9.0] - 2021-06-17 (Marcel Jerzyk)

Added

conclusions.tex section written & translated.
discussion.tex section written & translated.
results.tex section written & translated.
Conclusion written in abstract.tex section.

Changed

Added url formatting in paper.

Fixed

Cropped image in systematic review section and adjusted the width.

[0.8.1b] - 2021-06-15 (Marcel Jerzyk & Jakub Litkowski)

Added

README.md updated with reproduction instruction for groups M2, M3 & M4.

[0.8.1a] - 2021-06-13 (Marcel Jerzyk)

Changed

Added pandas, numpy, requests and gitpython as required packages in requirements.txt.

[0.8.1] - 2021-06-13 (Marcel Jerzyk)

Added

In README.md: information about the paper directory.

Changed

Moved /literature/ under the ./data directory.

Fixed

Spelling in CHANGELOG.md

[0.8.0] - 2021-06-13 (Marcel Jerzyk)

Changes

Further work on the paper:

Creating Machine Learning Model section created
First results are available in Results section
Whole paper was re-checked via grammar checker and adjusted accordingly
Whole paper got "lifting" via improved formatting
Sections got labels so they can be cross-referenced for improved "user reading experience"
Improved language and added contents in Data Processing section
- Among others, details regarding manual pre-processing of questionnaire.

Added

Data section in README.md

[0.7.2] - 2021-06-13 (Marcel Jerzyk)

Added

Used literature as .pdfs in the ./data/literature/ directory

[0.7.1] - 2021-06-04 (Marcel Jerzyk)

Changed

Literature Review was retranslated to formal ,,paper'' English.

[0.7.0] - 2021-06-02 (Marcel Jerzyk)

Added

New script: .../merge_jsons.py was added to the project.

This script takes on input:
- str: github_username,
Output is stored (by default) in:
- .data/scanned_user/<username>.json
The purpose of this script is to merge all of the .json files that are generated via .../scan_repositories.py script into one .json file that contains merged information from all of them in a 'gather' mode. This means that when in some other file information fetched by given combo of language-linter is present, then the results are summed up. This allows for even easier usage of the Mega Linter information via ML model because the processing work is already done.
The script also takes care of different .json-s as some of them are linter-related but others contain aggregated results.

Structure of the file is as follows:

{
  "<lang>": {
    "<linter>": {
      "errors": int,
      "files": int,
      "fixed": int
    },
    "total": {
      "clones": int,
      "duplicate_lines_num": int,
      "duplicate_tokens_num": int,
      "files": int,
      "lines": int,
      "tokens": int
    }
  }
}

There's special case for <lang> which is "Total:" - it contains all aggregated results.
The "total" subkey doesn't always have to exist, for examples it does not in <lang>: cspell, xml, yaml.

Example final output:

{
  "Total:": {
    "total": {
      "clones": 163,
      "duplicate_lines_num": 6061,
      "duplicate_tokens_num": 52202,
      "files": 428,
      "lines": 44590,
      "tokens": 473152
    }
  },
  "java": {
    "checkstyle": {
      "errors": 108,
      "files": 109,
      "fixed": 0
    },
    "total": {
      "clones": 56,
      "duplicate_lines_num": 948,
      "duplicate_tokens_num": 10656,
      "files": 106,
      "lines": 11093,
      "tokens": 106282
    }
  },
  "python": {
    "black": {
      "errors": 59,
      "files": 62,
      "fixed": 0
    },
    "flake8": {
      "errors": 2438,
      "files": 62,
      "fixed": 0
    },
    "isort": {
      "errors": 37,
      "files": 62,
      "fixed": 0
    },
    "pylint": {
      "errors": 2,
      "files": 62,
      "fixed": 0
    },
    "total": {
      "clones": 40,
      "duplicate_lines_num": 798,
      "duplicate_tokens_num": 8403,
      "files": 58,
      "lines": 6111,
      "tokens": 57864
    }
  },
  "spell": {
    "cspell": {
      "errors": 9555,
      "files": 640,
      "fixed": 0
    },
    "misspell": {
      "errors": 18,
      "files": 640,
      "fixed": 0
    }
  }
}

Added paper directory.
- It contains the LaTeX paper which got complete overhaul.
- Redundant text and formatting was removed.
- Comments that were pointless were removed and existing comments were standardized
- Moved sections of paper to separate files.
- Adjusted heading/title sections.
- Resolved few errors and warnings.
- Created directory /img/ for images.
- Created /misc/ folder for all the stuff that is not tightly related to the paper or structure.

Changed

Script .../scan_repositories.py now fetches repositories of given user automatically (no need to provide separate repositories list in order to fetch them)
- New input:
  - str: github_username,

[0.6.0] - 2021-05-25 (Marcel Jerzyk)

Added

New script: .../scan_repositories.py was added to the project.
- This script takes on input:
  - str: github username,
  - list[str]: github repositories names
- It automates the repositories cloning, Mega Linter linting and Mega Linter Scraper scraping routine. It does exactly as what the intuition might suggest:
  - first starts git clone on every single repository name given on input
  - then it uses MegaLinter to parse the repositories contents and generate na output file
  - and at the end - uses the previously made .../scraper.py to parse the file contents into a machine-readable .json format
- The output results are stored in /data/repositories/<username> _(the directory will be automatically created if it's not present yet)_
- The script will also try to clean the repositories directory from directories generated by earlier launches.

[0.5.0] - 2021-05-11 (Marcel Jerzyk)

Added

New directory: ./docs containing various files regarding the technical side of the project as well as images used in markdown files.
LaTeX document changes tracker: LANGv1.md. It has previous versions of sections and subsections:
- Systematic Review
- Research Questions
- Resources to Be Searched
- Results Selection Process
- Data Collection
- Data Pre-processing

Changed

Moved directory: ./img/readme/* inside ./docs (now: [./docs/img/readme/*]).

[0.4.0] - 2021-05-11 (Marcel Jerzyk)

This changelog entry will be filled in a few days.

Added

[0.3.1] - 2021-05-08 (Marcel Jerzyk)

Added

Logged in this file Version History for 0.3.1.
Logged in this file Version History for 0.3.0.
Logged in this file Version History for 0.2.1.
Logged in this file Version History for 0.2.0.
Logged in this file Version History for 0.1.0.
Logged in this file Version History for 0.0.1.

Fixed

Version Number in README.md.

[0.3.0] - 2021-05-08 (Marcel Jerzyk)

Added

New Script: scrape.py

The script takes on input output file which can be generated via Mega Linter by redirecting the standard output stream into a text file (> output.txt).

Script parses the log data and scrapes duplicates table information into dictionary:

{
    "language": str,
    "files": int,         # amount of detected files in given language by linter
    "lines": int,         # amount of detected lines in a given language
    "tokens": int,        # amount of detected tokens ("chars") in a given language
    "clones": int,
    "duplicate_lines_num": int,
    "duplicate_lines_percent": float,
    "duplicate_tokens_num": int,
    "duplicate_tokens_percent": float
},

Script parses the log data and scrapes summary table information into dictionary:

{
    "language": str,
    "linter": str,
    "files": int or str,  # amount of detected files in given language by linter
    "fixed": int,         # amount of fixed errors automatically by linter
    "errors": int         # amount of errors that could not be fixed by linter
},

All available information are properly parsed and saved as output.json file that contains list of the previously mentioned dictionaries.

New File: CHANGELOG.md
- This file serves as a diary of the progress of the programming side of this project.

Changed

Added new README.md entry about the new script file. It contains the information about the requirements needed in order to run the script as well as the run process itself with the expected output data.

[0.2.3] - 2021-05-04 (Marcel Jerzyk)

Added

README.md:
- Tutorial on how to install Docker environment.
- Tutorial on how to run Mega Linter locally on own repository.

[0.2.2] - 2021-04-26 (Jakub Litkowski)

Fixed

Fixed: value is missing where true / false is required
Fixed: arguments suggest a different number of lines: 1, 5, 0

[0.2.1] - 2021-04-26 (Marcel Jerzyk)

Added

README.md:
- Added information & created .gif files for group M2 that should work as exhaustive instruction on how to use the R Studio.
  - Tutorial: How to generate own GitHub Token.
  - Instruction: Installing R Studio.
  - Navigating in R Studio.
- Added tutorial & created .gif files that contain exhaustive information on how to use Mega Linter through Github Actions.
  - Information about the CI file and what contents it should have (including snippet for easy copy-paste).
  - Step by step actions in order to trigger the CI/CD Pipeline in GitHub.com on own repository with Mega Linter job.

[0.2.0] - 2021-04-20 (Jakub Litkowski)

Added

GraphQL Query Created

Scraped information from Query into variables:

Bio:

repositoriesNames <- json$data$repositoryOwner$repositories$edges$node$name
bio               <- json$data$repositoryOwner$bio
isHireable        <- json$data$repositoryOwner$isHireable
emptyRepos        <- json$data$repositoryOwner$repositories$edges$node$isEmpty
commitMsgE        <- json$data$repositoryOwner$repositories$edges$node$defaultBranchRef$target$history$edges

Commit Messages:

commitMSGList     <- list()
commitDates       <- list()

for(d in commitMsgE) {
  for(i in 1:length(d$node$author$user$login)){
    if( d$node$author$user$login[i] == "Luzkan"){
      commitDates <- c(commitDates,d$node$committedDate[i])
    }
  }
}

for (p in commitMsgE) {
  for(i in 1:length(p$node$author$user$login)){
    if( p$node$author$user$login[i] == "Luzkan"){
      commitMSGList <- c(commitMSGList,p$node$message[i])
    }
  }
}

Used Languages:

languages <- json$data$repositoryOwner$repositories$edges$node$languages$edges
languagesList <- list()

for (l in languages) {
  print(l)
  languagesList <- c(languagesList, l$node$name )
}

Calculating Time Between Commits:

time_between_commits <- list()

for(idx in seq_along(commitDates)){
  if (idx+1 > length(commitDates)){
      break
  }
  dateOne<-as.POSIXct(commitDates[[idx]], format = "%Y-%m-%dT%H:%M:%SZ")
  dateTwo<-as.POSIXct(commitDates[[idx+1]], format = "%Y-%m-%dT%H:%M:%SZ")
  time_between_commits[[idx]]<- as.numeric(difftime(dateOne,dateTwo, units="mins"))
}

average_time_between_commit <- mean(unlist(time_between_commits))

[0.1.0] - 2021-03-16 (Marcel Jerzyk)

Added

Created README.md for the project that contains various useful information, requirements and instructions in order ot run the program.
Created initial file structure.

github_graphql.r file:

Added imports that are required for GraphQL query creation:

library("ghql")
library("jsonlite")
library("dplyr")

GraphQL Connection Object

# GraphQL Connection Object (GitHub)
connection <- GraphqlClient$new(
  url = "https://api.github.com/graphql",
  headers = list(Authorization = paste0("Bearer ", token))
)

Informative Example GraphQL Query

new_query$query('mydata', '{
  repositoryOwner(login:"Luzkan") {
      repositories(first: 5, orderBy: {field:PUSHED_AT,direction:DESC}, isFork:false) {
      edges {
          node {
          name
          stargazers {
              totalCount
          }
          }
      }
      }
  }
  }')

Execution, parsing & writing to .json output

# Execute Query
(result <- connection$exec(new_query$queries$mydata))

# Parse to more human readable form
jsonlite::fromJSON(result)

# Writing to file
write(result, "output.json")

[0.0.1] - 2021-03-01 (Lech Madeyski)

Project was initialized.

FilesExpand file tree

CHANGELOG.md

Latest commit

History

CHANGELOG.md

File metadata and controls

Changelog

[1.0.2] - 2021-06-20 (Marcel Jerzyk)

Added

Changed

Removed

[1.0.1] - 2021-06-20 (Marcel Jerzyk)

Added

Fixed

[1.0.0] - 2021-06-20 (Marcel Jerzyk)

Added

Changed

Removed

[0.11.2] - 2021-06-17 (Marcel Jerzyk)

Changed

[0.11.1] - 2021-06-17 (Marcel Jerzyk)

Changed

Removed

[0.11.0] - 2021-06-17 (Jakub Litkowski)

Added

[0.10.1] - 2021-06-17 (Marcel Jerzyk)

Changed

[0.10.0] - 2021-06-17 (Jakub Szańca)

Added

[0.9.0] - 2021-06-17 (Marcel Jerzyk)

Added

Changed

Fixed

[0.8.1b] - 2021-06-15 (Marcel Jerzyk & Jakub Litkowski)

Added

[0.8.1a] - 2021-06-13 (Marcel Jerzyk)

Changed

[0.8.1] - 2021-06-13 (Marcel Jerzyk)

Added

Changed

Fixed

[0.8.0] - 2021-06-13 (Marcel Jerzyk)

Changes

Added

[0.7.2] - 2021-06-13 (Marcel Jerzyk)

Added

[0.7.1] - 2021-06-04 (Marcel Jerzyk)

Changed

[0.7.0] - 2021-06-02 (Marcel Jerzyk)

Added

Changed

[0.6.0] - 2021-05-25 (Marcel Jerzyk)

Added

[0.5.0] - 2021-05-11 (Marcel Jerzyk)

Added

Changed

[0.4.0] - 2021-05-11 (Marcel Jerzyk)

Added

[0.3.1] - 2021-05-08 (Marcel Jerzyk)

Added

Fixed

[0.3.0] - 2021-05-08 (Marcel Jerzyk)

Added

Changed

[0.2.3] - 2021-05-04 (Marcel Jerzyk)

Added

[0.2.2] - 2021-04-26 (Jakub Litkowski)

Fixed

[0.2.1] - 2021-04-26 (Marcel Jerzyk)

Added

[0.2.0] - 2021-04-20 (Jakub Litkowski)

Added

[0.1.0] - 2021-03-16 (Marcel Jerzyk)

Added

[0.0.1] - 2021-03-01 (Lech Madeyski)