Grant PID2020-117041GA-I00 funded by:

FineDesc Learner Corpus: user Guide

How to search the FineDesc Learner Corpus

There are two options to search the FineDesc Learner Corpus, which can be selected in the ‘Search Term(s)’ box.

The ‘orthographic words’ option allows the user to search for a word or phrase (up to five consecutive words), which are to be typed in the ‘orthographic words’ box below.

The second search option is to conduct a ‘word proximity search’. The user can search for a word or phrase (up to five consecutive words), within one to ten words from another word or phrase (again up to five consecutive words). The search may be bidirectional by clicking on the ‘bidirectional’ box.

Three wildcards may be used in the search

Retrieving the results: output types

The software provides four output types, available in the ‘results’ section in a drop-down menu:

a) Simple frequency: the user will obtain the total number of occurrences of the word or phrase typed in the ‘orthographic words’ box or in the ‘word proximity search’ boxes, the normalized number of occurrences per 1000000 words and the number of texts where the word/phrase appears out of the total number of texts in the FineDesc Learner Corpus or the subcorpus selected (see section ‘filters’ below).

b) Full frequency: together with the information obtained in the ‘simple frequency’ output, the user will be provided with further information regarding the occurrence of the word or phrase typed considering the different compilation variables in the FineDesc Learner Corpus, which are available in the filters: CEFR level, gender, L1, English status, preparatory course, text formality, text type and communicative function. If no filters are selected, information regarding the complete FineDesc Learner Corpus is offered.

c) KWIC: this output option offers the user concordance lines in which the KWIC is the word or phrase typed. The concordance lines may be ordered thanks to the ‘sorting’ box. The different sorting options are available in a drop-down menu, which allows the user to sort the concordance lines to the right (1R, 2R), to the left (1L, 2L), and considering the different compilation variables in the FineDesc Learner Corpus: CEFR level, candidate’s gender, L1, English status, preparation course, text formality, text type, communicative function, text topic and candidate number. The user can add as many sorting options as necessary by selecting them from the menu. The user can also decide the number of concordances per page (50, 100 or 500) in the drop-down menu ‘page size’. Case sensitivity can also be selected by clicking the ‘case’ box.

Once the concordance lines are shown on the screen (the KWIC is in bold), the user may have access to the word/phrase in a larger context than that of a concordance line. Likewise, the user may also obtain further information regarding the variables of the text and the candidate who wrote it. This is possible by clicking on the example number to the left of the concordance line.

d) Texts: when selecting this option, the user is provided with exhaustive information about each text in the FineDesc Learner Corpus. This option offers an exhaustive corpus breakdown, which allows the user to obtain information per text regarding the different compilation variables in the FineDesc Learner Corpus (CEFR level, candidate’s gender, L1, English status, preparation course, text formality, text type, communicative function, text topic) as well as the candidate number who wrote each text and the number of words in each text. This option has been enabled to help the user select the subcorpus or subcorpora of the FineDesc Learner Corpus which better suits his/her research interests.

Filters

The FineDesc Learner Corpus was compiled considering a number of variables regarding both the candidate and the text written. These variables can be found in the ‘filters’. Each filter has a drop-down menu where the different options per variable can be selected to analyse a subcorpus from the FineDesc Learner Corpus.

The variables considering the candidate are: a) the candidate’s gender; b) the candidate’s L1; c) the English status in the candidate’s plurilingual repertoire; and d) the candidate’s attendance or not to a preparatory course to take the exam.

The variables regarding the text are: e) the CEFR level of the text; f) the text formality; g) the text type; h) the main communicative function(s) in the text; i) the text topic; j) the candidate number, after the anonymization process which ensures that the candidate’s number cannot be tracked back to the candidate him/herself; and k) the number of texts each candidate provided to the corpus. The CertAcles exam requires learners to write two texts. However, only the texts which were evaluated as being at the specified CEFR level by two independent CEFR/CV expert raters were compiled. As a result, each candidate could provide one or two texts to the learner corpus (if one or two texts were evaluated as being at the CEFR level specified) or no texts at all (if the raters considered that the texts were not at the required CEFR level).

Retrieving the information from the FineDesc Learner Corpus

The user may download the results obtained in an Excel file (.xlsx) or in a .csv file

Special characters

The user may find in the output occurrences of ‘XXX’. This was used in the anonymization process to hide proper names, surnames, or any other type of sensitive information.