If you are reading this, you probably come from the SaCoCo tutorial, you have access to a CQPweb installation as administrator, and you want to encode the corpus. We document below two approaches:
The first approach is more involved, but allows for much more control and freedom. The second might be helpful, specially if you are a beginner, and the annotation of your corpus is pretty basic.
Let’s assume that you have:
cqp
installed in your computerThe first thing we need to do is to encode the corpus. This process will create a number of files that will enable to use the CQP language to query the corpus.
Once we have the texts in VRT format, encoding the corpus for the CWB is relatively easy.
Check that you have the corpus work bench installed in the computer, if not, download it and follow these instructions. We compiled from source version 3.4.8.
Now, run the following commands:
# create the target folder for encoded data
mkdir -p data/cqp/data
# run the command
cwb-encode -c utf8 -d data/cqp/data -F data/contemporary/meta/ -F data/historical/meta -R data/cqp/sacoco -xsB -S text:0+id+collection+source+year+decade+period+title -S p:0 -S s:0 -P pos -P lemma -P norm
# generate the registry file
cwb-make -r data/cqp -V SACOCO
The cwb-encode
’s parameters explained:
-c
to the declare the character encoding-d
path to the target directory were the output will be stored-F
path to the input directory were the VRT files are located-R
path to the registry file-xsB
x
for XML compatibility mode (recognises default entities and skips comments as well as an XML declaration)s
to skip blank lines in the inputB
to strip white spaces from tokens-S
to declare a structural attribute, example:
-S text:0+id+authors/
text
, structural attribute to be declared0
embedding levelsid
will be an attribute of text
containing some value-P
to declare positional attributesGet extensive information on how to encode corpora for the CWB in the encoding tutorial.
TIP: for development/testing purposes, just run the command below on the test files.
# create the target folder for encoded data
mkdir -p test/cqp/data
# run the command
cwb-encode -c utf8 -d test/cqp/data -F test/contemporary/meta/ -F test/historical/meta -R test/cqp/sacoco -xsB -S text:0+id+collection+source+year+decade+period+title -S p:0 -S s:0 -P pos -P lemma -P norm
# generate the registry file
cwb-make -r test/cqp -V SACOCO
Once you have the data you have to upload the file to the server where CQPweb is installed. In our case is the machine fedora.clarin-d.uni-saarland.de
.
In our case, one needs to connect to the server as root
user. There are different methods to upload the files:
scp
or rsync
which use the ssh
protocolUpload the local folder data/cqp/sacoco/
to the remote folder (in the server) /data2/cqpweb/indexed
, and the registry file data/cqp/sacoco
to the folder /data2/cqpweb/registry
.
Once all files are uploaded, you have to check the ownership of the folder/file:
wwwrun
www
If not just run a couple of commands:
chown -R wwwrun:www /data2/cqpweb/indexed/sacoco
chown wwwrun:www /data2/cqpweb/registry/sacoco
Then, modify the registry file /data2/cqpweb/registry/sacoco
to indicate the location of the corpus in the server /data2/cqpweb/indexed/sacoco
.
Go to admin control panel
in the left-hand menu Account actions.We can now start installing the corpus:
Install a new corpus
in the left menu CorporaClick here to install a corpus you have already indexed in CWB.
which you will find in the grey row at the top of the page.sacoco
Saarbrücken Cookbook Corpus
sacoco
Install corpus with settings above
that you will find at the bottom of the page.A new page will load:
Design and insert a text-metadata table for the corpus
A new page will load:
sacoco.meta
in section Choose the file containing the metadata
Describe the contents of the file you have selected
, providing for Handle and Description:
collection
as the primary category.title
as free textYes please
in section Do you want to automatically run frequency-list setup?
install metadata table using the settings above
Now set up the annotation (positional attributes):
Manage annotation
, you will find it in the left menu, in section Admin Tools
.Go!
Go!
Go!
pos
as Primary annotation
aboveUpdate annotation settings
.Check corpus settings:
Corpus settings
in Admin tools
General options
:
The corpus is currently in the following category:
Historical corporaUpdate
buttonUpdate
buttonWe set the access to this corpus open for everybody:
Admin Control Panel
in Admin tools
Manage privileges
in Users and privileges
sacoco
from list Generate default privileges for corpus...
Generate default privileges for this corpus
.Manage group grants
in Users and privileges
Grant new privilege to group
everybody
Normal access privilege for corpus [sacoco]
Grant privilege to group!
Hurraaaaah! Corpus ready to be queried!
Let’s assume that you have administrator access to a CQPweb installation. We will guide you in the following lines through the process of setting up the corpus.
We need a single XML file containing all texts. texts2corpus.py
helps us to ease the task.
texts2corpus.py
:
.vrt
files contained in the input folders<text>
nodes<text>
nodes to a parent element called <corpus>
<corpus>
as a single XML fileIts usage is pretty simple, just provide the path to the folders containing the .vrt
files with metadata, and the path to the output folder:
python3 texts2corpus.py -i data/contemporary/meta data/historical/meta -o data/sacoco.vrt
TIP: for development/testing purposes, if you just run
python3 texts2corpus.py
, it will work on the testing dataset stored in the test folder. ```
Go to admin control panel
in the left-hand menu Account actions.We need to upload the corpus file (sacoco.vrt
) and the metadata file (sacoco.meta
).
For each file:
Upload a file
in the left menu Uploads.Choose File
, a dialogue window will open, pick the file you want to upload.Upload File
.We can now start installing the corpus:
sacoco
sacoco
Saarbrücken Cookbook Corpus
Select files
section
sacoco.vrt
S-attributes
section
Use custom setup
p:0
P-attributes
section
Use custom setup
Primary
Install corpus with settings above
at the bottom of the page.A new page will load:
Design and insert a text-metadata table for the corpus
A new page will load:
sacoco.meta
in section Choose the file containing the metadata
Describe the contents of the file you have selected
, providing for Handle and Description:
Yes please
in section Do you want to automatically run frequency-list setup?
install metadata table using the settings above