Introduction

If you are reading this, you probably come from the SaCoCo tutorial, you have access to a CQPweb installation as administrator, and you want to encode the corpus. We document below two approaches:

  1. advanced
  2. “easy”

The first approach is more involved, but allows for much more control and freedom. The second might be helpful, specially if you are a beginner, and the annotation of your corpus is pretty basic.

Advanced

Let’s assume that you have:

  • cqp installed in your computer
  • administrator access to a CQPweb installation
  • root access to the server where the CQPweb lives

Encode the corpus for the CWB

The first thing we need to do is to encode the corpus. This process will create a number of files that will enable to use the CQP language to query the corpus.

Once we have the texts in VRT format, encoding the corpus for the CWB is relatively easy.

Check that you have the corpus work bench installed in the computer, if not, download it and follow these instructions. We compiled from source version 3.4.8.

Now, run the following commands:

# create the target folder for encoded data
mkdir -p data/cqp/data
# run the command
cwb-encode -c utf8 -d data/cqp/data -F data/contemporary/meta/ -F data/historical/meta -R data/cqp/sacoco -xsB -S text:0+id+collection+source+year+decade+period+title -S p:0 -S s:0 -P pos -P lemma -P norm
# generate the registry file
cwb-make -r data/cqp -V SACOCO

The cwb-encode’s parameters explained:

  • -c to the declare the character encoding
  • -d path to the target directory were the output will be stored
  • -F path to the input directory were the VRT files are located
  • -R path to the registry file
  • -xsB
    • x for XML compatibility mode (recognises default entities and skips comments as well as an XML declaration)
    • s to skip blank lines in the input
    • B to strip white spaces from tokens
  • -S to declare a structural attribute, example:
    • -S text:0+id+authors/
    • text, structural attribute to be declared
    • 0 embedding levels
    • id will be an attribute of text containing some value
  • -P to declare positional attributes

Get extensive information on how to encode corpora for the CWB in the encoding tutorial.

TIP: for development/testing purposes, just run the command below on the test files.

# create the target folder for encoded data
mkdir -p test/cqp/data
# run the command
cwb-encode -c utf8 -d test/cqp/data -F test/contemporary/meta/ -F test/historical/meta -R test/cqp/sacoco -xsB -S text:0+id+collection+source+year+decade+period+title -S p:0 -S s:0 -P pos -P lemma -P norm
# generate the registry file
cwb-make -r test/cqp -V SACOCO

Upload the corpus to the server and set permissions

Once you have the data you have to upload the file to the server where CQPweb is installed. In our case is the machine fedora.clarin-d.uni-saarland.de.

In our case, one needs to connect to the server as root user. There are different methods to upload the files:

  • via the command line with tools like scp or rsync which use the ssh protocol
  • via a FTP client like Filezilla

Upload the local folder data/cqp/sacoco/ to the remote folder (in the server) /data2/cqpweb/indexed, and the registry file data/cqp/sacoco to the folder /data2/cqpweb/registry.

Once all files are uploaded, you have to check the ownership of the folder/file:

  • the owner should be wwwrun
  • the group should be www

If not just run a couple of commands:

chown -R wwwrun:www /data2/cqpweb/indexed/sacoco
chown wwwrun:www /data2/cqpweb/registry/sacoco

Then, modify the registry file /data2/cqpweb/registry/sacoco to indicate the location of the corpus in the server /data2/cqpweb/indexed/sacoco.

Log in as admin in CQPweb

  1. Type the URL to your CQPweb installation (e.g. https://fedora.clarin-d.uni-saarland.de/cqpweb/)
  2. log in with an administrator account, you are redirected to your user account
  3. click on Go to admin control panel in the left-hand menu Account actions.

Installing the corpus

We can now start installing the corpus:

  1. click on Install a new corpus in the left menu Corpora
  2. click on the link Click here to install a corpus you have already indexed in CWB. which you will find in the grey row at the top of the page.
  3. Fill in the fields
    1. Specify a MySQL name for this corpus: sacoco
    2. Enter the full name of the corpus: Saarbrücken Cookbook Corpus
    3. Specify the CWB name (lowercase format): sacoco
  4. Click on the button Install corpus with settings above that you will find at the bottom of the page.

A new page will load:

  1. click on Design and insert a text-metadata table for the corpus

A new page will load:

  1. Choose sacoco.meta in section Choose the file containing the metadata
  2. Fill in the field rows in Describe the contents of the file you have selected, providing for Handle and Description:
    1. year
    2. decade
    3. period
    4. collection
    5. source
    6. title
  3. Mark collection as the primary category.
  4. Set title as free text
  5. Select Yes please in section Do you want to automatically run frequency-list setup?
  6. Finally, click on the button install metadata table using the settings above

Now set up the annotation (positional attributes):

  1. click on Manage annotation, you will find it in the left menu, in section Admin Tools.
  2. complete the annotation metadata information at the bottom:
    1. lemma: Description: lemma
    2. click on Go!
    3. pos: Description: pos; Tagset name: STTS; External URL: http://www.ims.uni-stuttgart.de/forschung/ressourcen/lexika/TagSets/stts-table.html
    4. click on Go!
    5. norm: Description norm; External URL: http://www.deutschestextarchiv.de/doku/software#cab
    6. click on Go!
    7. set pos as Primary annotation above
    8. click on Update annotation settings.

Check corpus settings:

  1. go to Corpus settings in Admin tools
  2. in General options:
    1. assign a category in field The corpus is currently in the following category: Historical corpora
    2. click on the Update button
    3. provide an external URL: http://hdl.handle.net/11858/00-246C-0000-001F-7C43-1
    4. click on the Update button

We set the access to this corpus open for everybody:

  1. go to Admin Control Panel in Admin tools
  2. go to Manage privileges in Users and privileges
  3. scroll to the bottom of the page, there
    1. select sacoco from list Generate default privileges for corpus...
    2. click on button Generate default privileges for this corpus.
  4. go to Manage group grants in Users and privileges
    1. scroll to the bottom, in section Grant new privilege to group
    2. Select group everybody
    3. Select a privilege Normal access privilege for corpus [sacoco]
    4. click on Grant privilege to group!

Hurraaaaah! Corpus ready to be queried!

“Easy”

Let’s assume that you have administrator access to a CQPweb installation. We will guide you in the following lines through the process of setting up the corpus.

Concatenate all texts in a single corpus VRT file

We need a single XML file containing all texts. texts2corpus.py helps us to ease the task.

texts2corpus.py:

  • finds all .vrt files contained in the input folders
  • gets the <text> nodes
  • appends the <text> nodes to a parent element called <corpus>
  • saves <corpus> as a single XML file

Its usage is pretty simple, just provide the path to the folders containing the .vrt files with metadata, and the path to the output folder:

python3 texts2corpus.py -i data/contemporary/meta data/historical/meta -o data/sacoco.vrt

TIP: for development/testing purposes, if you just run python3 texts2corpus.py, it will work on the testing dataset stored in the test folder. ```

Log in as admin

  1. Type the URL to your CQPweb installation (e.g. https://fedora.clarin-d.uni-saarland.de/cqpweb/)
  2. log in with an administrator account, you are redirected to your user account
  3. click on Go to admin control panel in the left-hand menu Account actions.

Upload files

We need to upload the corpus file (sacoco.vrt) and the metadata file (sacoco.meta).

For each file:

  1. click on Upload a file in the left menu Uploads.
  2. click on Choose File, a dialogue window will open, pick the file you want to upload.
  3. click on Upload File.

Installing the corpus

We can now start installing the corpus:

  1. click on Install a new corpus in the left menu Corpora
  2. in install new corpus section
    1. provide a MySQL name for the corpus: sacoco
    2. provide a name for the corpus: sacoco
    3. provide the full name of the corpus: Saarbrücken Cookbook Corpus
  3. in Select files section
    1. select sacoco.vrt
  4. in S-attributes section
    1. check the option on the left Use custom setup
    2. and then add in the boxes on the right:
      1. p:0
  5. in P-attributes section
    1. check the option on the left Use custom setup
    2. set the first row as Primary
    3. add the following three positional attributes, each value is a field:
      1. Handle: pos; Description: Part-Of-Speech; Tagset: STTS; External URL: http://www.ims.uni-stuttgart.de/forschung/ressourcen/lexika/TagSets/stts-table.html
      2. Handle: lemma; Description: lemma; Tagset: leave it empty; External URL: leave it empty.
      3. Handle: norm; Description: orthographic correction by CAB, Tagset: leave it empty; External URL: http://www.deutschestextarchiv.de/doku/software#cab
  6. click on Install corpus with settings above at the bottom of the page.

A new page will load:

  1. click on Design and insert a text-metadata table for the corpus

A new page will load:

  1. Choose sacoco.meta in section Choose the file containing the metadata
  2. Fill in the field rows in Describe the contents of the file you have selected, providing for Handle and Description:
    1. year
    2. decade
    3. period
    4. collection
    5. source
  3. Select Yes please in section Do you want to automatically run frequency-list setup?
  4. Finally, click on the button install metadata table using the settings above

Admin tools

  • Corpus settings: probably nothing to do here; has been set during the installation process
  • Manage access: to add user groups for your corpus (otherwise only the superuser can access the corpus!)
  • Manage metadata: probably nothing to do here; has been set during the installation process
  • Manage text categories: here you can add more “speaking” descriptions for your text categories
  • Manage annotation: add descriptions / URLs of documentations for your positional attributes; specify primary/secondary/… annotations for the CQP Simple Query language; specifying annotations here makes them available for restrictions throughout CQPweb (e.g. for the collocation function)
  • Manage privileges: Scroll to end of page and generate default privileges for the corpus; select than “Manage group grants”, scroll to end of page and select a group and grant it privileges of that particular corpus (normally normal privileges are chosen)